Evaluation: Research Area — Prolific Citations Library

Discover 32 peer-reviewed studies in Evaluation (2024–2026). Explore research findings powered by Prolific's diverse participant panel.

This page lists 32 peer-reviewed papers in the research area of Evaluation in the Prolific Citations Library, a curated collection of research powered by high-quality human data from Prolific.

Papers (20 of 32)

Bayesian teaching enables probabilistic reasoning in large language models

Authors: L Qiu, F Sha, K Allen, Y Kim, T Linzen, S van Steenkiste

Year: 2026

Published in: Nature …, 2026 - nature.com

Institution: Meta, Google DeepMind, Massachusetts Institute of Technology, Google Research, Google

Research Area: Probabilistic reasoning, Bayesian cognition, Neural language models, Reasoning, AI Evaluations

Discipline: Machine learning, Artificial intelligence

This paper sits at the intersection of machine learning and computational cognitive science, showing that large language models can acquire generalized probabilistic reasoning by being trained to imitate Bayesian belief updating rather than relying on prompting or heuristics.

Citations: 8
The Artificial Intelligence Disclosure Penalty: Humans Persistently Devalue AI-Generated Creative Writing

Authors: M Raj, JM Berg, R Seamans

Year: 2026

Published in: Journal of Experimental Psychology …, 2026 - psycnet.apa.org

Institution: New York University, University of Michigan, Wharton

Research Area: Disclosure psychology, Biases in human–machine evaluation, AI Biases

Discipline: Experimental psychology

This paper sits at the intersection of experimental psychology, social cognition, and consumer judgment, examining how AI disclosure triggers persistent authenticity-based bias against creative work, revealing a robust form of algorithmic aversion in symbolic and expressive domains.

DOI: https://doi.org/10.1037/xge0001889
Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework

Authors: N Petrova, A Gordon, E Blindow

Year: 2026

Published in: Open review

Institution: Prolific

Research Area: Human-centered AI evaluation, Bayesian statistics, Responsible AI, AI alignment, LLM Evaluation

Discipline: Machine Learning, Artificial Intelligence

The study introduces HUMAINE, a multidimensional evaluation framework for LLMs, revealing demographic-specific preference variations and ranking google/gemini-2.5-pro as the top-performing model with a posterior probability of 95.6%.

Methods: Multi-turn naturalistic conversations analyzed using a hierarchical Bayesian Bradley-Terry-Davidson model with post-stratification to census data, stratified across 22 demographic groups.

Key Findings: Performance of 28 LLMs across five human-centric dimensions, accounting for demographic-specific preferences.

Sample Size: 23404
Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models

Authors: L Ibrahim, C Akbulut, R Elasmar, C Rastogi, M Kahng, MR Morris, KR McKee, V Rieser, M Shanahan, L Weidinger

Year: 2025

Published in: arXiv preprint arXiv:2502.07077, 2025•arxiv.org

Institution: Google DeepMind, Google, University of Oxford

Research Area: Multimodal conversational AI, conversational AI, Evaluation methodology, benchmarking

Discipline: Computer Science, Natural Language Processing (NLP), Human–Computer Interaction (HCI)

The paper evaluates anthropomorphic behaviors in SOTA LLMs through a multi-turn methodology, showing that such behaviors, including empathy and relationship-building, predominantly emerge after multiple interactions and influence user perceptions.

Methods: Multi-turn evaluation of 14 anthropomorphic behaviors using simulations of user interactions, validated by a large-scale human subject study.

Key Findings: Anthropomorphic behaviors in large language models, including relationship-building and pronoun usage, and their perception by users.

Citations: 26

Sample Size: 1101
The Viability of Crowdsourcing for RAG Evaluation

Authors: L Gienapp, T Hagen, M Fröbe, M Hagen, B Stein, M Potthast, H Scells

Year: 2025

Published in: ArXiv

Institution: Bauhaus-Universitat Weimar, Friedrich-Schiller-Universitat Jena, Leipzig University, University of Kassel, ScaDS.AI, hessian.AI

Research Area: Crowdsourcing, RAG Evaluation, Artificial Intelligence, AI Evaluation, RAG

Discipline: Artificial Intelligence

The study investigates the feasibility of using crowdsourcing for RAG evaluation, finding that human pairwise judgments are reliable and cost-effective compared to LLM-based or automated methods.

Methods: Two complementary studies on response writing and response utility judgment using 903 human-written and 903 LLM-generated responses for 301 topics; pairwise judgments across seven utility dimensions were collected via human and LLM evaluators.

Key Findings: Human effectiveness in writing and judging responses in RAG scenarios, considering discourse styles and utility dimensions like coverage and coherence.

Citations: 4

Sample Size: 903
Conversational AI increases political knowledge as effectively as self-directed internet search

Authors: L Luettgau, HR Kirk, K Hackenburg, J Bergs, H Davidson, H Ogden, D Siddarth, S Huang

Year: 2025

Published in: ARXIV

Institution: AI Security Institute, I Policy Directorate, Collective Intelligence Project, Anthropic

Research Area: Experimental evaluation, RCT, Survey Research

Discipline: Computer Science, Human–Computer Interaction (HCI)

Conversational AI is as effective as self-directed internet searches in increasing political knowledge, reducing misinformation beliefs, and promoting accuracy among users in the UK during the 2024 election period.

Methods: A national survey (N=2,499) measured conversational AI usage for political information-seeking, followed by a series of randomised controlled trials (N=2,858) comparing conversational AI to self-directed internet search in improving political knowledge.

Key Findings: Extent of conversational AI usage for political knowledge-seeking in the UK and its efficacy in enhancing political knowledge and reducing misinformation compared to traditional internet searches.

Citations: 3

Sample Size: 5357
Trick or Neat: Adversarial Ambiguity and Language Model Evaluation

Authors: A Karamolegkou, O Eberle, P Rust, C Kauf, A Søgaard

Year: 2025

Published in: ArXiv

Institution: Aleph Alpha, Massachusetts Institute of Technology

Research Area: Adversarial Ambiguity, Language Model Evaluation, Artificial intelligence, Computation and Language, LLM, AI Evaluation, Red Teaming

Discipline: Natural Language Processing

The paper assesses language models' sensitivity to ambiguity using an adversarial dataset and finds that direct prompting poorly identifies ambiguity, while linear probes achieve high accuracy in decoding ambiguity from model representations.

Methods: An adversarial ambiguity dataset was introduced with various types of ambiguities and transformations; models were tested using direct prompts and linear probes trained on internal representations.

Key Findings: Language models' ability to detect ambiguity, including syntactic, lexical, and phonological types, as well as performance under adversarial variations.

Citations: 2
Benchmarking World-Model Learning

Authors: A Warrier, D Nguyen, M Naim, M Jain, Y Liang, K Schroeder, C Yang, JB Tenenbaum, S Vollmer, K Ellis, Z Tavares

Year: 2025

Published in: 2025 - arXiv preprint arXiv …, 2025 - arxiv.org

Institution: Basis Research Institute, DFKI GmbH, Harvard University, Quebec AI Institute, University of Cambridge, Massachusetts Institute of Technology, Cornell University

Research Area: Agent learning, World Models, Benchmarking, Evaluation protocols, RLHF, LLM

Discipline: Computer Science, Artificial Intelligence, Machine Learning

The paper introduces WorldTest, a novel protocol for evaluating model-learning agents using reward-free exploration and behavior-based scoring, and demonstrates that humans outperform models on the AutumnBench suite of tasks, revealing significant gaps in world-model learning.

Methods: The authors proposed WorldTest, a protocol separating reward-free interaction from scored tests in related environments, with evaluations done using AutumnBench—a dataset of 43 grid-world environments and 129 tasks across prediction, planning, and causal dynamics.

Key Findings: Performance of model-learning agents and humans in acquiring world models for masked-frame prediction, planning, and understanding causal dynamics.

Citations: 1

Sample Size: 517
Real-World Summarization: When Evaluation Reaches Its Limits

Authors: P Schmidtová, O Dušek, S Mahamood

Year: 2025

Published in: ArXiv

Institution: Charles University, Trivago

Research Area: Summarization evaluation, Natural Language Processing, LLM-as-a-Judge, AI Evaluation

Discipline: Natural Language Processing

Simpler metrics like word overlap surprisingly correlate well with human judgments in summarization evaluation, outperforming complex methods in out-of-domain applications, though LLMs remain unreliable for assessment due to annotation biases.

Methods: Human evaluation campaigns with categorical error assessment, span-level annotations, and comparison of traditional metrics, trainable models, and LLM-as-a-judge approaches.

Key Findings: Effectiveness of summarization evaluation methods and their correlation with human judgment, along with business impacts of incorrect information in generated summaries.

Citations: 1
Who's Sorry Now: User Preferences Among Rote, Empathic, and Explanatory Apologies from LLM Chatbots

Authors: Z Ashktorab, A Buccella, J D'Cruz, Z Fowler, A Gill, KY Leung, PD Magnus, J Richards

Year: 2025

Published in: arXiv preprint arXiv:2507.02745, 2025•arxiv.org

Institution: IBM Research, University at Albany

Research Area: Human–AI interaction, AI systems evaluation, UX, User Experience

Discipline: Computer Science, Human–Computer Interaction (HCI)

In a preregistered study with 162 participants, people generally prefer explanatory apologies from LLM chatbots over rote or purely empathic ones—though in biased error scenarios empathic apologies are sometimes favored—highlighting the complexity of designing chatbot apologies that effectively repair trust.

DOI: https://doi.org/10.48550/arXiv.2507.02745

Citations: 1
Whose view of safety? a deep dive dataset for pluralistic alignment of text-to-image models

Authors: C Rastogi, TH Teh, P Mishra, R Patel, D Wang, M Díaz, A Parrish, AM Davani, Z Ashwood

Year: 2025

Published in: arXiv preprint arXiv:2507.13383, 2025•arxiv.org

Institution: Google DeepMind, Google Research, Google

Research Area: AI alignment, safety evaluation, AI Safety, Multimodal evaluation, Human–AI interaction, LLM

Discipline: Computer Science, Machine Learning, Artificial Intelligence

This research introduces the DIVE dataset to enable pluralistic alignment in text-to-image models by accounting for diverse safety perspectives, revealing demographic variations in harm perception and advancing T2I model alignment strategies.

Methods: The study involved collecting feedback across 1000 prompts from demographically intersectional human raters to capture diverse safety perspectives, with an emphasis on empirical and contextual differences in harm perception.

Key Findings: Safety perceptions of text-to-image (T2I) model outputs from diverse demographic viewpoints and the influence of these perspectives on alignment strategies.

Citations: 1

Sample Size: 1000
10 Questions to Fall in Love with ChatGPT: An Experimental Study on Interpersonal Closeness with Large Language Models (LLMs)

Authors: J Szczuka, L Mühl, P Ebner, S Dubé

Year: 2025

Published in: ArXiv

Institution: University of Duisburg-Essen

Research Area: Human-Computer Interaction (HCI), Social Psychology, Interpersonal Relationships with AI, LLM Evaluation

Discipline: Social Science

Participants rated AI-generated dating profile responses equally as human-like in terms of closeness and romantic interest, challenging assumptions about authenticity in online communication.

Methods: Participants evaluated 10 AI-generated responses to an interpersonal closeness task in a matchmaking scenario, without knowing the responses were AI-generated.

Key Findings: Impact of perceived response source (human vs AI) on interpersonal closeness and romantic interest; influence of perceived quality and human-likeness.

Sample Size: 307
Large Language Models Pass the Turing Test

Authors: Cameron R. Jones, Benjamin K. Bergen

Year: 2025

Published in: ArXiv

Institution: University of California San Diego

Research Area: Artificial Intelligence, Computational Linguistics, Turing Test, AI Evaluation

Discipline: Artificial Intelligence

GPT-4.5 passed the Turing Test by being misidentified as human 73% of the time, surpassing real humans and other models, marking the first conclusive evidence of an AI achieving this standard.

Methods: Randomised, controlled, pre-registered Turing Test where 5-minute conversations were conducted between human participants and AI systems, followed by judgments on which partner was human.

Key Findings: The ability of AI systems (ELIZA, GPT-4o, LLaMa-3.1-405B, GPT-4.5) to mimic human conversational behavior and be perceived as human.
Multimodal large language models can make context-sensitive hate speech evaluations aligned with human judgement

Authors: T Davidson

Year: 2025

Published in: Nature Human Behaviour, 2025 - nature.com

Institution: University of Oxford, Davidson College

Research Area: Hate Speech Evaluation, Multimodal LLMs, Social Bias, Computational Law, AI Bias, AI Evaluation

Discipline: Artificial Intelligence

The study demonstrates that larger multimodal large language models (MLLMs) can align closely with human judgement in context-sensitive hate speech evaluations, though they still exhibit biases and limitations.

Methods: Conjoint experiments where simulated social media posts varying in attributes like slur usage and user demographics were evaluated by MLLMs and compared to human judgements.

Key Findings: The capacity of MLLMs to evaluate hate speech in a context-sensitive manner and their alignment with human judgement, while assessing biases and responsiveness to contextual cues.

Sample Size: 1854
ImagenHub: Standardizing the evaluation of conditional image generation models

Authors: M Ku, T Li, K Zhang, Y Lu, X Fu, W Zhuang

Year: 2024

Published in: - arXiv preprint arXiv …, 2023 - arxiv.org

Institution: University of Waterloo, Ohio State University, University of California Santa Barbara, University of Pensylvania

Research Area: AI alignment, Representation learning, Cognitive computational modeling, Vision foundation models evaluation, Multimodal, Vision models

Discipline: Computer Science, Artificial Intelligence, Machine Learning

This paper presents a method for **aligning machine vision model representations with human visual similarity judgments across different abstraction levels, improving how well models reflect human perceptual and conceptual organization and enhancing generalization and uncertainty prediction.

DOI: https://doi.org/10.48550/arXiv.2310.01596

Citations: 59
Revealing complexities when adult readers engage in the credibility evaluation of social media posts

Authors: M Kuutila, C Kiili, R Kupiainen, E Huusko, J Li

Year: 2024

Published in: Computers in Human ..., 2024 - Elsevier

Research Area: Social Media Credibility Evaluation, Human-Computer Interaction (HCI), Cyberpsychology, AI Evaluation

Discipline: Computer science, human–computer interaction, cyberpsychology

The study found that prior belief consistency and source expertise significantly influenced perceived credibility of health-related social media posts, while evidence quality had minimal impact. Crowdsourcing platform choice also affected credibility evaluations of inaccurate posts.

Methods: Researchers created social media posts with manipulated source characteristics, claim accuracy, and evidence quality. Participants evaluated the credibility of these posts via crowdsourcing platforms after having their prior topic beliefs assessed.

Key Findings: The perceived credibility of health-related social media posts based on source characteristics, evidence quality, prior beliefs, and the platform used for data collection.

DOI: https://doi.org/10.1016/j.chb.2023.108017

Citations: 19

Sample Size: 844
Planning for New Threats to Online Research Data Validity: The Issue of Computer-Using Agents

Authors: J Agley

Year: 2024

Published in: Evaluation & the Health Professions, 2025 - journals.sagepub.com

Institution: Indiana University, Prevention Insights

Research Area: Health Research and Evaluation, Data Validity, Computational Social Science

Discipline: Public Health, Computational Social Science

Citations: 2
Can Large Language Models Understand Symbolic Graphics Programs?

Authors: Z Qiu, W Liu, H Feng, Z Liu, T Xiao

Year: 2024

Published in: ArXiv

Institution: Massachusetts Institute of Technology, Max Planck Institute, University of Cambridge

Research Area: Computational cognition, LLM evaluation, Program synthesis, Multimodal reasoning

Discipline: Artificial Intelligence

Introduces SGP-Bench, a benchmark testing whether LLMs can answer semantic and spatial questions about images purely from graphics programs (SVG/CAD), effectively probing “visual imagination without vision.” The authors show current LLMs struggle - sometimes near chance - even when images are trivial for humans, but demonstrate that Symbolic Instruction Tuning (SIT) can meaningfully improve thi...
Controlled Evaluation of Syntactic Knowledge in Multilingual Language Models

Authors: Daria Kryvosheieva

Year: 2024

Published in: ArXiv

Institution: Massachusetts Institute of Technology

Research Area: Natural Language Processing, AI Evaluation

Discipline: Natural Language Processing
ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models

Authors: Thibaut Thonet, Jos Rozen, Laurent Besacier

Year: 2024

Published in: ArXiv

Institution: NAVER Labs

Research Area: Long-Context Language Models, Meeting Assistant Systems, Benchmark Evaluation

Discipline: Artificial Intelligence