Ai Benchmarking: Research Area — Prolific Citations Library

Discover 2 peer-reviewed studies in Ai Benchmarking (2025). Explore research findings powered by Prolific's diverse participant panel.

This page lists 2 peer-reviewed papers in the research area of Ai Benchmarking in the Prolific Citations Library, a curated collection of research powered by high-quality human data from Prolific.

Papers (2 of 2)

Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models

Authors: L Ibrahim, C Akbulut, R Elasmar, C Rastogi, M Kahng, MR Morris, KR McKee, V Rieser, M Shanahan, L Weidinger

Year: 2025

Published in: arXiv preprint arXiv:2502.07077, 2025•arxiv.org

Institution: Google DeepMind, Google, University of Oxford

Research Area: Multimodal conversational AI, conversational AI, Evaluation methodology, benchmarking

Discipline: Computer Science, Natural Language Processing (NLP), Human–Computer Interaction (HCI)

The paper evaluates anthropomorphic behaviors in SOTA LLMs through a multi-turn methodology, showing that such behaviors, including empathy and relationship-building, predominantly emerge after multiple interactions and influence user perceptions.

Methods: Multi-turn evaluation of 14 anthropomorphic behaviors using simulations of user interactions, validated by a large-scale human subject study.

Key Findings: Anthropomorphic behaviors in large language models, including relationship-building and pronoun usage, and their perception by users.

Citations: 26

Sample Size: 1101
All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark

Authors: D Testa, G Bonetta, R Bernardi, A Bondielli

Year: 2025

Published in: arXiv preprint arXiv ..., 2025 - arxiv.org

Institution: Università di Roma La Sapienza

Research Area: Multimodal Reasoning, AI Benchmarking

Discipline: Artificial Intelligence

MAIA is a benchmark designed to evaluate the reasoning abilities of Vision Language Models (VLMs) on video-based tasks, with a focus on Italian culture and language, revealing their fragility in consistency and visually grounded language comprehension and generation.

Methods: MAIA comprises a set of video-related questions tested with two tasks: visual statement verification and open-ended visual question answering, categorized into twelve reasoning types to disentangle language-vision relations.

Key Findings: The ability of Vision Language Models (VLMs) to perform consistent, visually grounded natural language understanding and generation across fine-grained reasoning categories.

DOI: https://doi.org/10.48550/arXiv.2502.16989