All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark
Authors: D Testa, G Bonetta, R Bernardi, A Bondielli
Published: 2025
Publication: arXiv preprint arXiv ..., 2025 - arxiv.org
MAIA is a benchmark designed to evaluate the reasoning abilities of Vision Language Models (VLMs) on video-based tasks, with a focus on Italian culture and language, revealing their fragility in consistency and visually grounded language comprehension and generation.
Methods: MAIA comprises a set of video-related questions tested with two tasks: visual statement verification and open-ended visual question answering, categorized into twelve reasoning types to disentangle language-vision relations.
Key Findings: The ability of Vision Language Models (VLMs) to perform consistent, visually grounded natural language understanding and generation across fine-grained reasoning categories.
Limitations: The benchmark highlights current model fragility but does not address broader generalization to non-Italian contexts or other cultural landscapes.
Institution: Università di Roma La Sapienza
Research Area: Multimodal Reasoning, AI Benchmarking
Discipline: Artificial Intelligence
DOI: https://doi.org/10.48550/arXiv.2502.16989