Visual cognition in multimodal large language models
Authors: LM Schulze Buschoff, E Akata, M Bethge
Published: 2025
Publication: Nature Machine ..., 2025 - nature.com
Vision-based large language models show proficiency in visual data interpretation but fall short in human-like abilities for causal reasoning, intuitive physics, and social cognition.
Methods: Controlled experiments evaluating model performance on tasks related to intuitive physics, causal reasoning, and intuitive psychology using visual processing benchmarks.
Key Findings: Model capabilities in understanding physical interactions, causal relationships, and social preferences.
Limitations: Models lack robust mechanisms for comprehending causality, physical dynamics, and social cognition to match human-like cognitive abilities.
Institution: Max Planck Institute
Research Area: Visual Cognition, Multimodal Large Language Models (MLLMs),Vision-Language Models (VLMs)
Discipline: Cognitive Science, Artificial Intelligence, Computer Vision
Citations: 70
DOI: https://doi.org/10.1038/s42256-024-00963-y