Benchmark Studies

This page lists 27 peer-reviewed papers tagged with Benchmark in the Prolific Citations Library, a curated collection of research powered by high-quality human data from Prolific.

Papers (20 of 27)

Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework

Authors: N Petrova, A Gordon, E Blindow

Year: 2026

Published in: Open review

Institution: Prolific

Research Area: Human-centered AI evaluation, Bayesian statistics, Responsible AI, AI alignment, LLM Evaluation

Discipline: Machine Learning, Artificial Intelligence

The study introduces HUMAINE, a multidimensional evaluation framework for LLMs, revealing demographic-specific preference variations and ranking google/gemini-2.5-pro as the top-performing model with a posterior probability of 95.6%.

Methods: Multi-turn naturalistic conversations analyzed using a hierarchical Bayesian Bradley-Terry-Davidson model with post-stratification to census data, stratified across 22 demographic groups.

Key Findings: Performance of 28 LLMs across five human-centric dimensions, accounting for demographic-specific preferences.

Sample Size: 23404
Visual cognition in multimodal large language models

Authors: LM Schulze Buschoff, E Akata, M Bethge

Year: 2025

Published in: Nature Machine ..., 2025 - nature.com

Institution: Max Planck Institute

Research Area: Visual Cognition, Multimodal Large Language Models (MLLMs), Vision-Language Models (VLMs)

Discipline: Cognitive Science, Artificial Intelligence, Computer Vision

Vision-based large language models show proficiency in visual data interpretation but fall short in human-like abilities for causal reasoning, intuitive physics, and social cognition.

Methods: Controlled experiments evaluating model performance on tasks related to intuitive physics, causal reasoning, and intuitive psychology using visual processing benchmarks.

Key Findings: Model capabilities in understanding physical interactions, causal relationships, and social preferences.

DOI: https://doi.org/10.1038/s42256-024-00963-y

Citations: 70
Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models

Authors: L Ibrahim, C Akbulut, R Elasmar, C Rastogi, M Kahng, MR Morris, KR McKee, V Rieser, M Shanahan, L Weidinger

Year: 2025

Published in: arXiv preprint arXiv:2502.07077, 2025•arxiv.org

Institution: Google DeepMind, Google, University of Oxford

Research Area: Multimodal conversational AI, conversational AI, Evaluation methodology, benchmarking

Discipline: Computer Science, Natural Language Processing, Human-Computer Interaction

The paper evaluates anthropomorphic behaviors in SOTA LLMs through a multi-turn methodology, showing that such behaviors, including empathy and relationship-building, predominantly emerge after multiple interactions and influence user perceptions.

Methods: Multi-turn evaluation of 14 anthropomorphic behaviors using simulations of user interactions, validated by a large-scale human subject study.

Key Findings: Anthropomorphic behaviors in large language models, including relationship-building and pronoun usage, and their perception by users.

Citations: 26

Sample Size: 1101
To rely or not to rely? evaluating interventions for appropriate reliance on large language models

Authors: JY Bo, S Wan, A Anderson

Year: 2025

Published in: Proceedings of the 2025 CHI Conference ..., 2025 - dl.acm.org

Institution: University of Toronto

Research Area: Appropriate reliance on LLM, Human-Computer Interaction, AI-assisted decision making.

Discipline: Human-Computer Interaction

This paper explores the latest advancements and key trends in the field of Human-Computer Interaction (HCI), focusing on novel interfaces and user experience paradigms.

Citations: 25
Impact of AI-Assisted Diagnosis on American Patients' Trust in and Intention to Seek Help From Health Care Professionals: Randomized, Web-Based Survey ...

Authors: C Chen, Z Cui

Year: 2025

Published in: Journal of Medical Internet Research, 2025 - jmir.org

Institution: Medical College of Wisconsin

Research Area: Trust in AI, AI-assisted diagnosis, Health communication, Healthcare human-AI interaction

Discipline: Digital Health, Human-Computer Interaction, Behavioral Science

Patients trust and are more likely to seek help from doctors explicitly avoiding AI-assisted diagnosis rather than those using extensive or moderate AI, highlighting a strong aversion to AI in healthcare settings.

Methods: A randomized, web-based 4-group survey experiment was conducted with controls for sociodemographic factors and analysis using regression, mediation, and moderation techniques.

Key Findings: Trust in and intention to seek medical help from health care professionals using AI-assisted diagnosis versus those avoiding AI, and the influence of demographic, social, and experiential factors.

DOI: https://doi.org/10.2196/66083

Citations: 4

Sample Size: 1762
Benchmarking World-Model Learning

Authors: A Warrier, D Nguyen, M Naim, M Jain, Y Liang, K Schroeder, C Yang, JB Tenenbaum, S Vollmer, K Ellis, Z Tavares

Year: 2025

Published in: 2025 - arXiv preprint arXiv …, 2025 - arxiv.org

Institution: Basis Research Institute, DFKI GmbH, Harvard University, Quebec AI Institute, University of Cambridge, Massachusetts Institute of Technology, Cornell University

Research Area: Agent learning, World Models, Benchmarking, Evaluation protocols, Reinforcement Learning from Human Feedback (RLHF), Large Language Models

Discipline: Computer Science, Artificial Intelligence, Machine Learning

The paper introduces WorldTest, a novel protocol for evaluating model-learning agents using reward-free exploration and behavior-based scoring, and demonstrates that humans outperform models on the AutumnBench suite of tasks, revealing significant gaps in world-model learning.

Methods: The authors proposed WorldTest, a protocol separating reward-free interaction from scored tests in related environments, with evaluations done using AutumnBench—a dataset of 43 grid-world environments and 129 tasks across prediction, planning, and causal dynamics.

Key Findings: Performance of model-learning agents and humans in acquiring world models for masked-frame prediction, planning, and understanding causal dynamics.

Citations: 1

Sample Size: 517
To Mask or to Mirror: Human-AI Alignment in Collective Reasoning

Authors: C Qian, AT Parisi, C Bouleau, V Tsai

Year: 2025

Published in: Proceedings of the ..., 2025 - aclanthology.org

Institution: Google, Google DeepMind

Research Area: Human-AI Alignment, Collective Reasoning, Social Biases, LLM Simulation of Human Behavior, AI Bias

Discipline: Natural Language Processing, Artificial Intelligence, Computational Social Science

This study examines human-AI alignment in collective reasoning using an empirical framework, demonstrating how LLMs either mirror or mask human biases depending on context, cues, and model-specific inductive biases.

Methods: The study uses the Lost at Sea social psychology task in a large-scale online experiment, simulating LLM groups conditioned on human decision-making data across varying conditions of visible or pseudonymous demographics.

Key Findings: Alignment of LLM behavior with human social reasoning, focusing on collective decision-making and biases in group interactions.

Citations: 1

Sample Size: 748
All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark

Authors: D Testa, G Bonetta, R Bernardi, A Bondielli

Year: 2025

Published in: arXiv preprint arXiv ..., 2025 - arxiv.org

Institution: Università di Roma La Sapienza

Research Area: Multimodal Reasoning, AI Benchmarking

Discipline: Artificial Intelligence

MAIA is a benchmark designed to evaluate the reasoning abilities of Vision Language Models (VLMs) on video-based tasks, with a focus on Italian culture and language, revealing their fragility in consistency and visually grounded language comprehension and generation.

Methods: MAIA comprises a set of video-related questions tested with two tasks: visual statement verification and open-ended visual question answering, categorized into twelve reasoning types to disentangle language-vision relations.

Key Findings: The ability of Vision Language Models (VLMs) to perform consistent, visually grounded natural language understanding and generation across fine-grained reasoning categories.

DOI: https://doi.org/10.48550/arXiv.2502.16989
Improving Human-AI Coordination through Adversarial Training and Generative Models

Authors: Paresh Chaudhary, Yancheng Liang, Daphne Chen, Simon S. Du, Natasha Jaques

Year: 2025

Published in: ArXiv

Institution: University of Washington

Research Area: Human-AI Coordination, Zero-Shot Coordination, Adversarial Training, Generative Models

Discipline: Artificial Intelligence, Human-Computer Interaction

The paper introduces GOAT, a novel framework combining pretrained generative models and adversarial training to improve human-AI coordination, achieving state-of-the-art performance on the Overcooked benchmark with real human partners.

Methods: The study utilized a frozen pretrained generative model to simulate cooperative agent policies and applied adversarial training to dynamically generate challenging human-AI interaction scenarios for training.

Key Findings: The effectiveness of GOAT in generalizing human-AI coordination strategies and its performance on the Overcooked benchmark.
Multimodal large language models can make context-sensitive hate speech evaluations aligned with human judgement

Authors: T Davidson

Year: 2025

Published in: Nature Human Behaviour, 2025 - nature.com

Institution: University of Oxford, Davidson College

Research Area: Hate Speech Evaluation, Multimodal LLMs, Social Bias, Computational Law, AI Bias, AI Evaluation

Discipline: Artificial Intelligence

The study demonstrates that larger multimodal large language models (MLLMs) can align closely with human judgement in context-sensitive hate speech evaluations, though they still exhibit biases and limitations.

Methods: Conjoint experiments where simulated social media posts varying in attributes like slur usage and user demographics were evaluated by MLLMs and compared to human judgements.

Key Findings: The capacity of MLLMs to evaluate hate speech in a context-sensitive manner and their alignment with human judgement, while assessing biases and responsiveness to contextual cues.

Sample Size: 1854
ImagenHub: Standardizing the evaluation of conditional image generation models

Authors: M Ku, T Li, K Zhang, Y Lu, X Fu, W Zhuang

Year: 2024

Published in: - arXiv preprint arXiv …, 2023 - arxiv.org

Institution: University of Waterloo, Ohio State University, University of California Santa Barbara, University of Pensylvania

Research Area: AI alignment, Representation learning, Cognitive computational modeling, Vision foundation models evaluation, Multimodal, Vision models

Discipline: Computer Science, Artificial Intelligence, Machine Learning

This paper presents a method for **aligning machine vision model representations with human visual similarity judgments across different abstraction levels, improving how well models reflect human perceptual and conceptual organization and enhancing generalization and uncertainty prediction.

DOI: https://doi.org/10.48550/arXiv.2310.01596

Citations: 59
Annotator in the Loop: A Case Study of In-Depth Rater Engagement to Create a Prosocial Benchmark Dataset

Authors: S Schmer-Galunder, R Wheelock, Z Jalan

Year: 2024

Published in: Proceedings of the ..., 2024 - ojs.aaai.org

Institution: Google DeepMind, Google, Accenture, Amazon

Research Area: AI Ethics and Prosocial Data Annotation

Discipline: Artificial Intelligence, Ethics, Behavioral Science

DOI: https://doi.org/10.1609/aies.v7i1.31726

Citations: 3
Are Large Language Models More Empathetic than Humans?

Authors: A Welivita, P Pu

Year: 2024

Published in: ArXiv

Institution: École Polytechnique Fédérale de Lausanne

Research Area: Large Language Models, Empathy, Human-AI Interaction

Discipline: Artificial Intelligence, Human-Computer Interaction, Social Science
Benchmarking Distributional Alignment of Large Language Models

Authors: N Meister

Year: 2024

Published in: ArXiv

Institution: Stanford University

Research Area: Distributional Alignment of LLMs, LLM Benchmarking, AI Robustness, AI Fairness, AI Bias

Discipline: Artificial Intelligence
Can Large Language Models Understand Symbolic Graphics Programs?

Authors: Z Qiu, W Liu, H Feng, Z Liu, T Xiao

Year: 2024

Published in: ArXiv

Institution: Massachusetts Institute of Technology, Max Planck Institute, University of Cambridge

Research Area: Computational cognition, LLM evaluation, Program synthesis, Multimodal reasoning

Discipline: Artificial Intelligence

Introduces SGP-Bench, a benchmark testing whether LLMs can answer semantic and spatial questions about images purely from graphics programs (SVG/CAD), effectively probing “visual imagination without vision.” The authors show current LLMs struggle - sometimes near chance - even when images are trivial for humans, but demonstrate that Symbolic Instruction Tuning (SIT) can meaningfully improve thi...
ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models

Authors: Thibaut Thonet, Jos Rozen, Laurent Besacier

Year: 2024

Published in: ArXiv

Institution: NAVER Labs

Research Area: Long-Context Language Models, Meeting Assistant Systems, Benchmark Evaluation

Discipline: Artificial Intelligence
Image-conditioned human language comprehension and psychometric benchmarking of visual language models

Authors: SN Pushpita, R Levy

Year: 2024

Published in: Proceedings of the 28th Conference on ..., 2024 - aclanthology.org

Institution: Masachusetts Institute of Technology

Research Area: Visual Language Models (VLMs), Psycholinguistics, Psychometric Benchmarking

Discipline: Artificial Intelligence

DOI: https://doi.org/10.18653/v1/2024.conll-1.34
Improved Distribution Matching Distillation for Fast Image Synthesis

Authors: Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Frédo Durand, William T. Freeman

Year: 2024

Published in: ArXiv

Institution: Adobe Research, Massachusetts Institute of Technology

Research Area: Computer Vision, Image Synthesis, Diffusion Models

Discipline: Artificial Intelligence
MAIA: A benchmark for multimodal AI assessment

Authors: D Testa, G Bonetta, R Bernardi

Year: 2024

Published in: Proceedings of the ..., 2025 - aclanthology.org

Institution: Università di Roma La Sapienza, Fondazione Bruno Kessler, University of Pisa

Research Area: Multimodal AI Assessment, Visual Language Models (VLMs), Video Understanding, Computational Linguistics

Discipline: Artificial Intelligence, Computational Linguistics
SafetyAnalyst: Interpretable, transparent, and steerable LLM safety moderation

Authors: Jing-Jing Li♡♠ Valentina Pyatkin♠ Max Kleiman-Weiner♣ Liwei Jiang♣ Nouha Dziri♠ &Anne G. E. Collins♡ Jana Schaich Borg♢ Maarten Sap♠◆ Yejin Choi♣ Sydney Levine♠

Year: 2024

Published in: ArXiv

Institution: Allen Institute for AI, Duke University, University of California Berkeley, University of Washington

Research Area: LLM Safety Moderation, Explainable AI (XAI), LLM Alignment, Steerable AI

Discipline: Artificial Intelligence