Evaluation: Study Type — Prolific Citations Library

Browse 14 peer-reviewed papers in Evaluation. Discover studies powered by high-quality human data from Prolific.

This page lists 14 peer-reviewed papers classified as Evaluation in the Prolific Citations Library, a curated collection of research powered by high-quality human data from Prolific.

Papers (14 of 14)

Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework

Authors: N Petrova, A Gordon, E Blindow

Year: 2026

Published in: Open review

Institution: Prolific

Research Area: Human-centered AI evaluation, Bayesian statistics, Responsible AI, AI alignment, LLM Evaluation

Discipline: Machine Learning, Artificial Intelligence

The study introduces HUMAINE, a multidimensional evaluation framework for LLMs, revealing demographic-specific preference variations and ranking google/gemini-2.5-pro as the top-performing model with a posterior probability of 95.6%.

Methods: Multi-turn naturalistic conversations analyzed using a hierarchical Bayesian Bradley-Terry-Davidson model with post-stratification to census data, stratified across 22 demographic groups.

Key Findings: Performance of 28 LLMs across five human-centric dimensions, accounting for demographic-specific preferences.

Sample Size: 23404
Visual cognition in multimodal large language models

Authors: LM Schulze Buschoff, E Akata, M Bethge

Year: 2025

Published in: Nature Machine ..., 2025 - nature.com

Institution: Max Planck Institute

Research Area: Visual Cognition, Multimodal Large Language Models (MLLMs), Vision-Language Models (VLMs)

Discipline: Cognitive Science, Artificial Intelligence, Computer Vision

Vision-based large language models show proficiency in visual data interpretation but fall short in human-like abilities for causal reasoning, intuitive physics, and social cognition.

Methods: Controlled experiments evaluating model performance on tasks related to intuitive physics, causal reasoning, and intuitive psychology using visual processing benchmarks.

Key Findings: Model capabilities in understanding physical interactions, causal relationships, and social preferences.

DOI: https://doi.org/10.1038/s42256-024-00963-y

Citations: 70
Can large language models assess personality from asynchronous video interviews? A comprehensive evaluation of validity, reliability, fairness, and rating patterns

Authors: T Zhang, A Koutsoumpis, JK Oostrom

Year: 2025

Published in: IEEE Transactions ..., 2024 - ieeexplore.ieee.org

Institution: Southeast University, Vrije Universiteit, Tilburg University

Research Area: LLM Personality Assessment, Human-AI Interaction, LLM

Discipline: Human-AI Interaction, Social Science, Humanities

LLMs like GPT-3.5 and GPT-4 can rival or outperform task-specific AI models in assessing personality traits from asynchronous video interviews, but show uneven performance, low reliability, and potential biases, warranting cautious use in high-stakes scenarios.

Methods: The study evaluated GPT-3.5 and GPT-4 performance in assessing personality traits and interview performance using simulated AVI responses, comparing them with ratings from task-specific AI and human annotators.

Key Findings: Validity, reliability, fairness, and rating patterns of LLMs (GPT-3.5 and GPT-4) in personality assessment from asynchronous video interviews.

Citations: 31

Sample Size: 685
Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models

Authors: L Ibrahim, C Akbulut, R Elasmar, C Rastogi, M Kahng, MR Morris, KR McKee, V Rieser, M Shanahan, L Weidinger

Year: 2025

Published in: arXiv preprint arXiv:2502.07077, 2025•arxiv.org

Institution: Google DeepMind, Google, University of Oxford

Research Area: Multimodal conversational AI, conversational AI, Evaluation methodology, benchmarking

Discipline: Computer Science, Natural Language Processing (NLP), Human–Computer Interaction (HCI)

The paper evaluates anthropomorphic behaviors in SOTA LLMs through a multi-turn methodology, showing that such behaviors, including empathy and relationship-building, predominantly emerge after multiple interactions and influence user perceptions.

Methods: Multi-turn evaluation of 14 anthropomorphic behaviors using simulations of user interactions, validated by a large-scale human subject study.

Key Findings: Anthropomorphic behaviors in large language models, including relationship-building and pronoun usage, and their perception by users.

Citations: 26

Sample Size: 1101
REL-AI: An interaction-centered approach to measuring human-lm reliance

Authors: K Zhou, JD Hwang, X Ren, N Dziri

Year: 2025

Published in: Proceedings of the ..., 2025 - aclanthology.org

Institution: Stanford University, University of Southern California, Carnegie Mellon University, Allen Institute for AI

Research Area: Human-LM Reliance, Interaction-Centered Framework, Human-Computer Interaction (HCI)

Discipline: Human-Computer Interaction (HCI), Artificial Intelligence

The study introduces Rel-A.I., an interaction-centered evaluation approach to measure human reliance on LLM responses, revealing that politeness and interaction context significantly influence user reliance.

Methods: Nine user studies were conducted, analyzing user reliance influenced by LLM communication features such as politeness and context through participant interaction experiments.

Key Findings: The degree of human reliance on LLM responses based on communication style (e.g., politeness) and interaction context (e.g., knowledge domain, prior interactions).

Citations: 18

Sample Size: 450
Crowdsourced comparative judgement for evaluating learner texts: How reliable are judges recruited from an online crowdsourcing platform?

Authors: P Thwaites, N Vandeweerd, M Paquot

Year: 2025

Published in: Applied Linguistics, 2025 - academic.oup.com

Institution: University College Londonouvain, Radboud University Nijmegen, Fonds de la Recherche Scientifique – FNRS

Research Area: Applied Linguistics, Educational Assessment, Crowdsourcing

Discipline: Applied Linguistics

The study demonstrates that crowdsourcing platforms can recruit judges to evaluate learner texts with reliability and validity comparable to assessments conducted by trained linguists.

Methods: Judges recruited via an online crowdsourcing platform conducted comparative judgement assessments of learner texts to measure writing proficiency.

Key Findings: Reliability and concurrent validity of learner text evaluations performed via crowdsourced judges compared to linguist evaluations.

Citations: 10
Trick or Neat: Adversarial Ambiguity and Language Model Evaluation

Authors: A Karamolegkou, O Eberle, P Rust, C Kauf, A Søgaard

Year: 2025

Published in: ArXiv

Institution: Aleph Alpha, Massachusetts Institute of Technology

Research Area: Adversarial Ambiguity, Language Model Evaluation, Artificial intelligence, Computation and Language, LLM, AI Evaluation, Red Teaming

Discipline: Natural Language Processing

The paper assesses language models' sensitivity to ambiguity using an adversarial dataset and finds that direct prompting poorly identifies ambiguity, while linear probes achieve high accuracy in decoding ambiguity from model representations.

Methods: An adversarial ambiguity dataset was introduced with various types of ambiguities and transformations; models were tested using direct prompts and linear probes trained on internal representations.

Key Findings: Language models' ability to detect ambiguity, including syntactic, lexical, and phonological types, as well as performance under adversarial variations.

Citations: 2
Caution when Crowdsourcing: Prolific as a Superior Platform Compared with MTurk

Authors: D OConnell, A Bautista

Year: 2025

Published in: ... Student Journal of ..., 2025 - journals.library.columbia.edu

Institution: University of Houston, Webster University

Research Area: Crowdsourcing Research Methodology, Human-Computer Interaction (HCI)

Discipline: Computational Social Science, Behavioral Research

Prolific outperforms MTurk in participant data quality and affordability for online survey-based research.

Methods: Data from participants recruited via MTurk and Prolific were analyzed for cost, attention measures, participation duration, and internal consistency.

Key Findings: Comparison of data quality and cost-effectiveness between MTurk and Prolific for online survey recruitment.

Citations: 1

Sample Size: 699
Evaluating mobile-based data collection for crowdsourcing behavioral research

Authors: DT Esch, N Mylonopoulos, V Theoharakis

Year: 2025

Published in: Behavior Research Methods, 2025 - Springer

Institution: University of Cologne, University of Piraeus, Aristotle University of Thessaloniki

Research Area: Crowdsourcing Behavioral Research, Mobile Data Collection

Discipline: Behavioral Research

Mobile-based responses via platforms like Pollfish are comparable in quality to computer-based ones from MTurk and Prolific, though attentiveness varies significantly across platforms and is influenced by factors like incentives, distractions, and system 1 thinking.

Methods: Conducted two studies distributing the same survey across MTurk, Prolific, Pollfish, and Qualtrics panels to compare data quality and analyze attentiveness scores.

Key Findings: Attentiveness, device usage (mobile vs. computer), and factors influencing data quality such as incentives, respondent activity, distractions, and survey familiarity.

Citations: 1
Real-World Summarization: When Evaluation Reaches Its Limits

Authors: P Schmidtová, O Dušek, S Mahamood

Year: 2025

Published in: ArXiv

Institution: Charles University, Trivago

Research Area: Summarization evaluation, Natural Language Processing, LLM-as-a-Judge, AI Evaluation

Discipline: Natural Language Processing

Simpler metrics like word overlap surprisingly correlate well with human judgments in summarization evaluation, outperforming complex methods in out-of-domain applications, though LLMs remain unreliable for assessment due to annotation biases.

Methods: Human evaluation campaigns with categorical error assessment, span-level annotations, and comparison of traditional metrics, trainable models, and LLM-as-a-judge approaches.

Key Findings: Effectiveness of summarization evaluation methods and their correlation with human judgment, along with business impacts of incorrect information in generated summaries.

Citations: 1
A Robot Walks into a Bar: Can Language Models Serve as Creativity Support Tools for Comedy? An Evaluation of LLMs’ Humour Alignment with Comedians

Authors: PW Mirowski, J Love, K Mathewson, S Mohamed

Year: 2024

Published in: ArXiv

Institution: Google DeepMind, Google

Research Area: AI Creativity, Humor Generation, Human-Computer Interaction (HCI)

Discipline: Artificial Intelligence

Professional comedians found LLMs insufficient as creativity support tools for comedy, citing bias, bland output, and reinforcement of hegemonic viewpoints.

Methods: Workshops conducted with professional comedians combining comedy writing sessions using LLMs, a Creativity Support Index questionnaire, and focus groups discussing their experiences and ethical concerns.

Key Findings: Effectiveness of LLMs as creativity support tools for comedy writing, ethical concerns (bias, censorship, copyright), and value alignment in AI outputs.

Citations: 52

Sample Size: 20
Organic or diffused: Can we distinguish human art from ai-generated images?

Authors: AYJ Ha, J Passananti, R Bhaskar, S Shan

Year: 2024

Published in: Proceedings of the ..., 2024 - dl.acm.org

Institution: University of California Santa Barbara, The University of Chicago, Institute of Education, University College London

Research Area: Human-Computer Interaction (HCI), Generative AI, Digital Forensics

Discipline: Human-Computer Interaction (HCI), Generative AI, Digital Forensics

The paper investigates the effectiveness of different approaches, including both human and automated detectors, in distinguishing human art from AI-generated images, finding that a combination of methods offers the best performance despite persistent weaknesses.

Methods: Comparison of human art across 7 styles with AI-generated images from 5 generative models, assessed using 5 automated detectors and 3 human groups (crowdworkers, professional artists, expert artists).

Key Findings: Detection accuracy and robustness of human and automated methods in identifying AI-generated images under benign and adversarial conditions.

DOI: 10.1145/3658644.3670306

Citations: 52

Sample Size: 3993
First-person fairness in chatbots

Authors: T Eloundou, A Beutel, DG Robinson

Year: 2024

Published in: arXiv preprint arXiv ..., 2024 - arxiv.org

Institution: OpenAI, Google DeepMind, Google, University of Oxford

Research Area: Fairness in LLM, AI Bias, AI Ethics

Discipline: Artificial Intelligence, Social Science

The paper introduces a counterfactual approach to evaluate 'first-person fairness' in chatbots, demonstrating that reinforcement learning can mitigate biases based on demographics across extensive chatbot interactions.

Methods: The study uses a Language Model as a Research Assistant (LMRA) to quantitatively and qualitatively assess biases based on demographics across millions of chatbot interactions, covering 66 tasks in 9 domains and involving two genders and four races. Bias evaluations are corroborated by independent...

Key Findings: Demographic biases in chatbot responses, including harmful stereotypes and response differences by gender and race, across diverse tasks and domains.

DOI: https://doi.org/10.48550/arXiv.2410.19803

Citations: 33

Sample Size: 6000000
Better Be Computer or I'm Dumb": A Large-Scale Evaluation of Humans as Audio Deepfake Detectors

Authors: K Warren, T Tucker, A Crowder, D Olszewski

Year: 2024

Published in: Proceedings of the ..., 2024 - dl.acm.org

Institution: University of Florida

Research Area: Audio Deepfake Detection, Human Factors in AI Security, Perceptual Studies, AI Security

Discipline: Computer Science

Humans outperform machine learning models in classifying real human audio versus deepfakes, but are often misled by preconceptions about generated content, highlighting the need for more synergistic approaches between human and machine decision-making.

Methods: A large-scale user study was conducted where over 1,200 participants evaluated audio samples from three widely-cited deepfake datasets. Performance was quantitatively measured and thematic analysis was used to explore user reasoning and differences from machine classification.

Key Findings: Comparison of human and machine classification performance on audio deepfake detection, analysis of user reasoning, and evaluation of error patterns between both humans and models.

DOI: https://doi.org/10.1145/3658644.3670325

Citations: 14

Sample Size: 1200