Browse 14 peer-reviewed papers in Evaluation. Discover studies powered by high-quality human data from Prolific.
This page lists 14 peer-reviewed papers classified as Evaluation in the Prolific Citations Library, a curated collection of research powered by high-quality human data from Prolific.
-
Authors: N Petrova, A Gordon, E Blindow
Year: 2026
Published in: Open review
Institution: Prolific
Research Area: Human-centered AI evaluation, Bayesian statistics, Responsible AI, AI alignment, LLM Evaluation
Discipline: Machine Learning, Artificial Intelligence
The study introduces HUMAINE, a multidimensional evaluation framework for LLMs, revealing demographic-specific preference variations and ranking google/gemini-2.5-pro as the top-performing model with a posterior probability of 95.6%.
Methods: Multi-turn naturalistic conversations analyzed using a hierarchical Bayesian Bradley-Terry-Davidson model with post-stratification to census data, stratified across 22 demographic groups.
Key Findings: Performance of 28 LLMs across five human-centric dimensions, accounting for demographic-specific preferences.
Sample Size: 23404
-
Authors: LM Schulze Buschoff, E Akata, M Bethge
Year: 2025
Published in: Nature Machine ..., 2025 - nature.com
Institution: Max Planck Institute
Research Area: Visual Cognition, Multimodal Large Language Models (MLLMs), Vision-Language Models (VLMs)
Discipline: Cognitive Science, Artificial Intelligence, Computer Vision
Vision-based large language models show proficiency in visual data interpretation but fall short in human-like abilities for causal reasoning, intuitive physics, and social cognition.
Methods: Controlled experiments evaluating model performance on tasks related to intuitive physics, causal reasoning, and intuitive psychology using visual processing benchmarks.
Key Findings: Model capabilities in understanding physical interactions, causal relationships, and social preferences.
DOI: https://doi.org/10.1038/s42256-024-00963-y
Citations: 70
-
Authors: T Zhang, A Koutsoumpis, JK Oostrom
Year: 2025
Published in: IEEE Transactions ..., 2024 - ieeexplore.ieee.org
Institution: Southeast University, Vrije Universiteit, Tilburg University
Research Area: LLM Personality Assessment, Human-AI Interaction, LLM
Discipline: Human-AI Interaction, Social Science, Humanities
LLMs like GPT-3.5 and GPT-4 can rival or outperform task-specific AI models in assessing personality traits from asynchronous video interviews, but show uneven performance, low reliability, and potential biases, warranting cautious use in high-stakes scenarios.
Methods: The study evaluated GPT-3.5 and GPT-4 performance in assessing personality traits and interview performance using simulated AVI responses, comparing them with ratings from task-specific AI and human annotators.
Key Findings: Validity, reliability, fairness, and rating patterns of LLMs (GPT-3.5 and GPT-4) in personality assessment from asynchronous video interviews.
Citations: 31
Sample Size: 685
-
Authors: L Ibrahim, C Akbulut, R Elasmar, C Rastogi, M Kahng, MR Morris, KR McKee, V Rieser, M Shanahan, L Weidinger
Year: 2025
Published in: arXiv preprint arXiv:2502.07077, 2025•arxiv.org
Institution: Google DeepMind, Google, University of Oxford
Research Area: Multimodal conversational AI, conversational AI, Evaluation methodology, benchmarking
Discipline: Computer Science, Natural Language Processing (NLP), Human–Computer Interaction (HCI)
The paper evaluates anthropomorphic behaviors in SOTA LLMs through a multi-turn methodology, showing that such behaviors, including empathy and relationship-building, predominantly emerge after multiple interactions and influence user perceptions.
Methods: Multi-turn evaluation of 14 anthropomorphic behaviors using simulations of user interactions, validated by a large-scale human subject study.
Key Findings: Anthropomorphic behaviors in large language models, including relationship-building and pronoun usage, and their perception by users.
Citations: 26
Sample Size: 1101
-
Authors: K Zhou, JD Hwang, X Ren, N Dziri
Year: 2025
Published in: Proceedings of the ..., 2025 - aclanthology.org
Institution: Stanford University, University of Southern California, Carnegie Mellon University, Allen Institute for AI
Research Area: Human-LM Reliance, Interaction-Centered Framework, Human-Computer Interaction (HCI)
Discipline: Human-Computer Interaction (HCI), Artificial Intelligence
The study introduces Rel-A.I., an interaction-centered evaluation approach to measure human reliance on LLM responses, revealing that politeness and interaction context significantly influence user reliance.
Methods: Nine user studies were conducted, analyzing user reliance influenced by LLM communication features such as politeness and context through participant interaction experiments.
Key Findings: The degree of human reliance on LLM responses based on communication style (e.g., politeness) and interaction context (e.g., knowledge domain, prior interactions).
Citations: 18
Sample Size: 450
-
Authors: P Thwaites, N Vandeweerd, M Paquot
Year: 2025
Published in: Applied Linguistics, 2025 - academic.oup.com
Institution: University College Londonouvain, Radboud University Nijmegen, Fonds de la Recherche Scientifique – FNRS
Research Area: Applied Linguistics, Educational Assessment, Crowdsourcing
Discipline: Applied Linguistics
The study demonstrates that crowdsourcing platforms can recruit judges to evaluate learner texts with reliability and validity comparable to assessments conducted by trained linguists.
Methods: Judges recruited via an online crowdsourcing platform conducted comparative judgement assessments of learner texts to measure writing proficiency.
Key Findings: Reliability and concurrent validity of learner text evaluations performed via crowdsourced judges compared to linguist evaluations.
Citations: 10
-
Authors: A Karamolegkou, O Eberle, P Rust, C Kauf, A Søgaard
Year: 2025
Published in: ArXiv
Institution: Aleph Alpha, Massachusetts Institute of Technology
Research Area: Adversarial Ambiguity, Language Model Evaluation, Artificial intelligence, Computation and Language, LLM, AI Evaluation, Red Teaming
Discipline: Natural Language Processing
The paper assesses language models' sensitivity to ambiguity using an adversarial dataset and finds that direct prompting poorly identifies ambiguity, while linear probes achieve high accuracy in decoding ambiguity from model representations.
Methods: An adversarial ambiguity dataset was introduced with various types of ambiguities and transformations; models were tested using direct prompts and linear probes trained on internal representations.
Key Findings: Language models' ability to detect ambiguity, including syntactic, lexical, and phonological types, as well as performance under adversarial variations.
Citations: 2
-
Authors: D OConnell, A Bautista
Year: 2025
Published in: ... Student Journal of ..., 2025 - journals.library.columbia.edu
Institution: University of Houston, Webster University
Research Area: Crowdsourcing Research Methodology, Human-Computer Interaction (HCI)
Discipline: Computational Social Science, Behavioral Research
Prolific outperforms MTurk in participant data quality and affordability for online survey-based research.
Methods: Data from participants recruited via MTurk and Prolific were analyzed for cost, attention measures, participation duration, and internal consistency.
Key Findings: Comparison of data quality and cost-effectiveness between MTurk and Prolific for online survey recruitment.
Citations: 1
Sample Size: 699
-
Authors: DT Esch, N Mylonopoulos, V Theoharakis
Year: 2025
Published in: Behavior Research Methods, 2025 - Springer
Institution: University of Cologne, University of Piraeus, Aristotle University of Thessaloniki
Research Area: Crowdsourcing Behavioral Research, Mobile Data Collection
Discipline: Behavioral Research
Mobile-based responses via platforms like Pollfish are comparable in quality to computer-based ones from MTurk and Prolific, though attentiveness varies significantly across platforms and is influenced by factors like incentives, distractions, and system 1 thinking.
Methods: Conducted two studies distributing the same survey across MTurk, Prolific, Pollfish, and Qualtrics panels to compare data quality and analyze attentiveness scores.
Key Findings: Attentiveness, device usage (mobile vs. computer), and factors influencing data quality such as incentives, respondent activity, distractions, and survey familiarity.
Citations: 1
-
Authors: P Schmidtová, O Dušek, S Mahamood
Year: 2025
Published in: ArXiv
Institution: Charles University, Trivago
Research Area: Summarization evaluation, Natural Language Processing, LLM-as-a-Judge, AI Evaluation
Discipline: Natural Language Processing
Simpler metrics like word overlap surprisingly correlate well with human judgments in summarization evaluation, outperforming complex methods in out-of-domain applications, though LLMs remain unreliable for assessment due to annotation biases.
Methods: Human evaluation campaigns with categorical error assessment, span-level annotations, and comparison of traditional metrics, trainable models, and LLM-as-a-judge approaches.
Key Findings: Effectiveness of summarization evaluation methods and their correlation with human judgment, along with business impacts of incorrect information in generated summaries.
Citations: 1
-
Authors: PW Mirowski, J Love, K Mathewson, S Mohamed
Year: 2024
Published in: ArXiv
Institution: Google DeepMind, Google
Research Area: AI Creativity, Humor Generation, Human-Computer Interaction (HCI)
Discipline: Artificial Intelligence
Professional comedians found LLMs insufficient as creativity support tools for comedy, citing bias, bland output, and reinforcement of hegemonic viewpoints.
Methods: Workshops conducted with professional comedians combining comedy writing sessions using LLMs, a Creativity Support Index questionnaire, and focus groups discussing their experiences and ethical concerns.
Key Findings: Effectiveness of LLMs as creativity support tools for comedy writing, ethical concerns (bias, censorship, copyright), and value alignment in AI outputs.
Citations: 52
Sample Size: 20
-
Authors: AYJ Ha, J Passananti, R Bhaskar, S Shan
Year: 2024
Published in: Proceedings of the ..., 2024 - dl.acm.org
Institution: University of California Santa Barbara, The University of Chicago, Institute of Education, University College London
Research Area: Human-Computer Interaction (HCI), Generative AI, Digital Forensics
Discipline: Human-Computer Interaction (HCI), Generative AI, Digital Forensics
The paper investigates the effectiveness of different approaches, including both human and automated detectors, in distinguishing human art from AI-generated images, finding that a combination of methods offers the best performance despite persistent weaknesses.
Methods: Comparison of human art across 7 styles with AI-generated images from 5 generative models, assessed using 5 automated detectors and 3 human groups (crowdworkers, professional artists, expert artists).
Key Findings: Detection accuracy and robustness of human and automated methods in identifying AI-generated images under benign and adversarial conditions.
DOI: 10.1145/3658644.3670306
Citations: 52
Sample Size: 3993
-
Authors: T Eloundou, A Beutel, DG Robinson
Year: 2024
Published in: arXiv preprint arXiv ..., 2024 - arxiv.org
Institution: OpenAI, Google DeepMind, Google, University of Oxford
Research Area: Fairness in LLM, AI Bias, AI Ethics
Discipline: Artificial Intelligence, Social Science
The paper introduces a counterfactual approach to evaluate 'first-person fairness' in chatbots, demonstrating that reinforcement learning can mitigate biases based on demographics across extensive chatbot interactions.
Methods: The study uses a Language Model as a Research Assistant (LMRA) to quantitatively and qualitatively assess biases based on demographics across millions of chatbot interactions, covering 66 tasks in 9 domains and involving two genders and four races. Bias evaluations are corroborated by independent...
Key Findings: Demographic biases in chatbot responses, including harmful stereotypes and response differences by gender and race, across diverse tasks and domains.
DOI: https://doi.org/10.48550/arXiv.2410.19803
Citations: 33
Sample Size: 6000000
-
Authors: K Warren, T Tucker, A Crowder, D Olszewski
Year: 2024
Published in: Proceedings of the ..., 2024 - dl.acm.org
Institution: University of Florida
Research Area: Audio Deepfake Detection, Human Factors in AI Security, Perceptual Studies, AI Security
Discipline: Computer Science
Humans outperform machine learning models in classifying real human audio versus deepfakes, but are often misled by preconceptions about generated content, highlighting the need for more synergistic approaches between human and machine decision-making.
Methods: A large-scale user study was conducted where over 1,200 participants evaluated audio samples from three widely-cited deepfake datasets. Performance was quantitatively measured and thematic analysis was used to explore user reasoning and differences from machine classification.
Key Findings: Comparison of human and machine classification performance on audio deepfake detection, analysis of user reasoning, and evaluation of error patterns between both humans and models.
DOI: https://doi.org/10.1145/3658644.3670325
Citations: 14
Sample Size: 1200