Discover 13 peer-reviewed studies in Multimodal (2024–2025). Explore research findings powered by Prolific's diverse participant panel.
This page lists 13 peer-reviewed papers in the research area of Multimodal in the Prolific Citations Library, a curated collection of research powered by high-quality human data from Prolific.
-
Authors: LM Schulze Buschoff, E Akata, M Bethge
Year: 2025
Published in: Nature Machine ..., 2025 - nature.com
Institution: Max Planck Institute
Research Area: Visual Cognition, Multimodal Large Language Models (MLLMs), Vision-Language Models (VLMs)
Discipline: Cognitive Science, Artificial Intelligence, Computer Vision
Vision-based large language models show proficiency in visual data interpretation but fall short in human-like abilities for causal reasoning, intuitive physics, and social cognition.
Methods: Controlled experiments evaluating model performance on tasks related to intuitive physics, causal reasoning, and intuitive psychology using visual processing benchmarks.
Key Findings: Model capabilities in understanding physical interactions, causal relationships, and social preferences.
DOI: https://doi.org/10.1038/s42256-024-00963-y
Citations: 70
-
Authors: L Ibrahim, C Akbulut, R Elasmar, C Rastogi, M Kahng, MR Morris, KR McKee, V Rieser, M Shanahan, L Weidinger
Year: 2025
Published in: arXiv preprint arXiv:2502.07077, 2025•arxiv.org
Institution: Google DeepMind, Google, University of Oxford
Research Area: Multimodal conversational AI, conversational AI, Evaluation methodology, benchmarking
Discipline: Computer Science, Natural Language Processing (NLP), Human–Computer Interaction (HCI)
The paper evaluates anthropomorphic behaviors in SOTA LLMs through a multi-turn methodology, showing that such behaviors, including empathy and relationship-building, predominantly emerge after multiple interactions and influence user perceptions.
Methods: Multi-turn evaluation of 14 anthropomorphic behaviors using simulations of user interactions, validated by a large-scale human subject study.
Key Findings: Anthropomorphic behaviors in large language models, including relationship-building and pronoun usage, and their perception by users.
Citations: 26
Sample Size: 1101
-
Authors: C Rastogi, TH Teh, P Mishra, R Patel, D Wang, M Díaz, A Parrish, AM Davani, Z Ashwood
Year: 2025
Published in: arXiv preprint arXiv:2507.13383, 2025•arxiv.org
Institution: Google DeepMind, Google Research, Google
Research Area: AI alignment, safety evaluation, AI Safety, Multimodal evaluation, Human–AI interaction, LLM
Discipline: Computer Science, Machine Learning, Artificial Intelligence
This research introduces the DIVE dataset to enable pluralistic alignment in text-to-image models by accounting for diverse safety perspectives, revealing demographic variations in harm perception and advancing T2I model alignment strategies.
Methods: The study involved collecting feedback across 1000 prompts from demographically intersectional human raters to capture diverse safety perspectives, with an emphasis on empirical and contextual differences in harm perception.
Key Findings: Safety perceptions of text-to-image (T2I) model outputs from diverse demographic viewpoints and the influence of these perspectives on alignment strategies.
Citations: 1
Sample Size: 1000
-
Authors: D Testa, G Bonetta, R Bernardi, A Bondielli
Year: 2025
Published in: arXiv preprint arXiv ..., 2025 - arxiv.org
Institution: Università di Roma La Sapienza
Research Area: Multimodal Reasoning, AI Benchmarking
Discipline: Artificial Intelligence
MAIA is a benchmark designed to evaluate the reasoning abilities of Vision Language Models (VLMs) on video-based tasks, with a focus on Italian culture and language, revealing their fragility in consistency and visually grounded language comprehension and generation.
Methods: MAIA comprises a set of video-related questions tested with two tasks: visual statement verification and open-ended visual question answering, categorized into twelve reasoning types to disentangle language-vision relations.
Key Findings: The ability of Vision Language Models (VLMs) to perform consistent, visually grounded natural language understanding and generation across fine-grained reasoning categories.
DOI: https://doi.org/10.48550/arXiv.2502.16989
-
Authors: T Davidson
Year: 2025
Published in: Nature Human Behaviour, 2025 - nature.com
Institution: University of Oxford, Davidson College
Research Area: Hate Speech Evaluation, Multimodal LLMs, Social Bias, Computational Law, AI Bias, AI Evaluation
Discipline: Artificial Intelligence
The study demonstrates that larger multimodal large language models (MLLMs) can align closely with human judgement in context-sensitive hate speech evaluations, though they still exhibit biases and limitations.
Methods: Conjoint experiments where simulated social media posts varying in attributes like slur usage and user demographics were evaluated by MLLMs and compared to human judgements.
Key Findings: The capacity of MLLMs to evaluate hate speech in a context-sensitive manner and their alignment with human judgement, while assessing biases and responsiveness to contextual cues.
Sample Size: 1854
-
Authors: B Grimm, P Yilmam, B Talbot, L Larsen
Year: 2025
Published in: npj Digital Medicine, 2025 - nature.com
Institution: Videra Health
Research Area: Computational Mental Health Assessment, Multimodal Machine Learning
Discipline: Computational Health, Digital Medicine
A multimodal machine learning model using text (MPNet) and voice (HuBERT) analysis predicts depression, anxiety, and trauma from a single video-based question with strong performance and demographic consistency while significantly reducing assessment time.
Methods: Multimodal analysis combining MPNet for textual data and HuBERT for prosodic voice features trained on video-based responses.
Key Findings: Efficient prediction of self-reported scores for depression (PHQ-9), anxiety (GAD-7), and trauma (PCL-5) from brief video responses.
Sample Size: 2420
-
Authors: M Ku, T Li, K Zhang, Y Lu, X Fu, W Zhuang
Year: 2024
Published in: - arXiv preprint arXiv …, 2023 - arxiv.org
Institution: University of Waterloo, Ohio State University, University of California Santa Barbara, University of Pensylvania
Research Area: AI alignment, Representation learning, Cognitive computational modeling, Vision foundation models evaluation, Multimodal, Vision models
Discipline: Computer Science, Artificial Intelligence, Machine Learning
This paper presents a method for **aligning machine vision model representations with human visual similarity judgments across different abstraction levels, improving how well models reflect human perceptual and conceptual organization and enhancing generalization and uncertainty prediction.
DOI: https://doi.org/10.48550/arXiv.2310.01596
Citations: 59
-
Authors: E Watson, T Viana, S Zhang
Year: 2024
Published in: AI, 2023 - mdpi.com
Research Area: Behavioral Annotation Tools and Multimodal Data
Discipline: Computer Science
The paper systematically reviews augmented behavioral annotation tools, focusing on their evolution, current state, and application to multimodal datasets and models, highlighting best practices and emerging challenges in safe and ethical annotation for large-scale multimodal systems.
Methods: Systematic literature review analyzing crowd and machine learning-augmented behavioral annotation methods, with cross-disciplinary comparisons and structured synthesis of practices.
Key Findings: Evolution of behavioral annotation tools, their integration with machine learning, emerging trends (e.g., prompt engineering), challenges in large multimodal datasets, and ethical and engineering best practices.
DOI: https://doi.org/10.3390/ai4010007
Citations: 17
-
Authors: T Davidson
Year: 2024
Published in: 2024 - files.osf.io
Institution: University of Cambridge
Research Area: Content Moderation, Multimodal LLM Auditing, Computational Social Science
Discipline: Computational Social Science
Citations: 2
-
Authors: V Kewenig, C Edwards
Year: 2024
Published in: ... and Rechardt, Akilles ..., 2023 - papers.ssrn.com
Research Area: Multimodal AI, Cognitive Science, Visual-Linguistic Integration
Discipline: Artificial Intelligence, Computational Linguistics, Cognitive Science
Citations: 2
-
Authors: Z Qiu, W Liu, H Feng, Z Liu, T Xiao
Year: 2024
Published in: ArXiv
Institution: Massachusetts Institute of Technology, Max Planck Institute, University of Cambridge
Research Area: Computational cognition, LLM evaluation, Program synthesis, Multimodal reasoning
Discipline: Artificial Intelligence
Introduces SGP-Bench, a benchmark testing whether LLMs can answer semantic and spatial questions about images purely from graphics programs (SVG/CAD), effectively probing “visual imagination without vision.” The authors show current LLMs struggle - sometimes near chance - even when images are trivial for humans, but demonstrate that Symbolic Instruction Tuning (SIT) can meaningfully improve thi...
-
Authors: D Testa, G Bonetta, R Bernardi
Year: 2024
Published in: Proceedings of the ..., 2025 - aclanthology.org
Institution: Università di Roma La Sapienza, Fondazione Bruno Kessler, University of Pisa
Research Area: Multimodal AI Assessment, Visual Language Models (VLMs), Video Understanding, Computational Linguistics
Discipline: Artificial Intelligence, Computational Linguistics
-
Authors: Yunzhi Zhang, Zizhang Li, Matt Zhou, Shangzhe Wu, Jiajun Wu
Year: 2024
Published in: ArXiv
Institution: Stanford University, University of California Berkeley
Research Area: Artificial Intelligence, Computer Vision, Multimodal Reasoning
Discipline: Artificial Intelligence