Llm Evaluation: Research Area — Prolific Citations Library

Discover 13 peer-reviewed studies in Llm Evaluation (2023–2026). Explore research findings powered by Prolific's diverse participant panel.

This page lists 13 peer-reviewed papers in the research area of Llm Evaluation in the Prolific Citations Library, a curated collection of research powered by high-quality human data from Prolific.

Papers (13 of 13)

Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework

Authors: N Petrova, A Gordon, E Blindow

Year: 2026

Published in: Open review

Institution: Prolific

Research Area: Human-centered AI evaluation, Bayesian statistics, Responsible AI, AI alignment, LLM Evaluation

Discipline: Machine Learning, Artificial Intelligence

The study introduces HUMAINE, a multidimensional evaluation framework for LLMs, revealing demographic-specific preference variations and ranking google/gemini-2.5-pro as the top-performing model with a posterior probability of 95.6%.

Methods: Multi-turn naturalistic conversations analyzed using a hierarchical Bayesian Bradley-Terry-Davidson model with post-stratification to census data, stratified across 22 demographic groups.

Key Findings: Performance of 28 LLMs across five human-centric dimensions, accounting for demographic-specific preferences.

Sample Size: 23404
Trick or Neat: Adversarial Ambiguity and Language Model Evaluation

Authors: A Karamolegkou, O Eberle, P Rust, C Kauf, A Søgaard

Year: 2025

Published in: ArXiv

Institution: Aleph Alpha, Massachusetts Institute of Technology

Research Area: Adversarial Ambiguity, Language Model Evaluation, Artificial intelligence, Computation and Language, LLM, AI Evaluation, Red Teaming

Discipline: Natural Language Processing

The paper assesses language models' sensitivity to ambiguity using an adversarial dataset and finds that direct prompting poorly identifies ambiguity, while linear probes achieve high accuracy in decoding ambiguity from model representations.

Methods: An adversarial ambiguity dataset was introduced with various types of ambiguities and transformations; models were tested using direct prompts and linear probes trained on internal representations.

Key Findings: Language models' ability to detect ambiguity, including syntactic, lexical, and phonological types, as well as performance under adversarial variations.

Citations: 2
Real-World Summarization: When Evaluation Reaches Its Limits

Authors: P Schmidtová, O Dušek, S Mahamood

Year: 2025

Published in: ArXiv

Institution: Charles University, Trivago

Research Area: Summarization evaluation, Natural Language Processing, LLM-as-a-Judge, AI Evaluation

Discipline: Natural Language Processing

Simpler metrics like word overlap surprisingly correlate well with human judgments in summarization evaluation, outperforming complex methods in out-of-domain applications, though LLMs remain unreliable for assessment due to annotation biases.

Methods: Human evaluation campaigns with categorical error assessment, span-level annotations, and comparison of traditional metrics, trainable models, and LLM-as-a-judge approaches.

Key Findings: Effectiveness of summarization evaluation methods and their correlation with human judgment, along with business impacts of incorrect information in generated summaries.

Citations: 1
10 Questions to Fall in Love with ChatGPT: An Experimental Study on Interpersonal Closeness with Large Language Models (LLMs)

Authors: J Szczuka, L Mühl, P Ebner, S Dubé

Year: 2025

Published in: ArXiv

Institution: University of Duisburg-Essen

Research Area: Human-Computer Interaction (HCI), Social Psychology, Interpersonal Relationships with AI, LLM Evaluation

Discipline: Social Science

Participants rated AI-generated dating profile responses equally as human-like in terms of closeness and romantic interest, challenging assumptions about authenticity in online communication.

Methods: Participants evaluated 10 AI-generated responses to an interpersonal closeness task in a matchmaking scenario, without knowing the responses were AI-generated.

Key Findings: Impact of perceived response source (human vs AI) on interpersonal closeness and romantic interest; influence of perceived quality and human-likeness.

Sample Size: 307
Multimodal large language models can make context-sensitive hate speech evaluations aligned with human judgement

Authors: T Davidson

Year: 2025

Published in: Nature Human Behaviour, 2025 - nature.com

Institution: University of Oxford, Davidson College

Research Area: Hate Speech Evaluation, Multimodal LLMs, Social Bias, Computational Law, AI Bias, AI Evaluation

Discipline: Artificial Intelligence

The study demonstrates that larger multimodal large language models (MLLMs) can align closely with human judgement in context-sensitive hate speech evaluations, though they still exhibit biases and limitations.

Methods: Conjoint experiments where simulated social media posts varying in attributes like slur usage and user demographics were evaluated by MLLMs and compared to human judgements.

Key Findings: The capacity of MLLMs to evaluate hate speech in a context-sensitive manner and their alignment with human judgement, while assessing biases and responsiveness to contextual cues.

Sample Size: 1854
Can Large Language Models Understand Symbolic Graphics Programs?

Authors: Z Qiu, W Liu, H Feng, Z Liu, T Xiao

Year: 2024

Published in: ArXiv

Institution: Massachusetts Institute of Technology, Max Planck Institute, University of Cambridge

Research Area: Computational cognition, LLM evaluation, Program synthesis, Multimodal reasoning

Discipline: Artificial Intelligence

Introduces SGP-Bench, a benchmark testing whether LLMs can answer semantic and spatial questions about images purely from graphics programs (SVG/CAD), effectively probing “visual imagination without vision.” The authors show current LLMs struggle - sometimes near chance - even when images are trivial for humans, but demonstrate that Symbolic Instruction Tuning (SIT) can meaningfully improve thi...
Emotionally Numb or Empathetic? Evaluating How LLMs Feel Using EmotionBench

Authors: Jen-tse Huang, Man Ho Lam, Eric John Li, Shujie Ren, Wenxuan Wang, Wenxiang Jiao, Zhaopeng Tu, Michael R. Lyu

Year: 2024

Published in: Preprint

Institution: Chinese University of Hong Kong, Tianjin Medical University

Research Area: LLM Emotional Evaluation, Affective Computing, Artificial Intelligence in Psychology

Discipline: Artificial Intelligence
Evaluating Creative Short Story Generation in Humans and Large Language Models

Authors: Mete Ismayilzada1,2, Claire Stevenson3, Lonneke van der Plas

Year: 2024

Published in: ArXiv

Institution: Idiap Research Institute, University of Amsterdam, Università della Svizzera Italiana, École Polytechnique Fédérale de Lausanne

Research Area: Creative Story Generation, LLM Evaluation, Computational Creativity

Discipline: Artificial Intelligence, Natural Language Processing, Computational Creativity
Larger and more instructable language models become less reliable

Authors: Lexin Zhou, Wout Schellaert, Fernando Martínez-Plumed, Yael Moros-Daval, Cèsar Ferri & José Hernández-Orallo

Year: 2024

Published in: Nature

Institution: Universitat Politècnica de València, University of Cambridge, ValGRAI

Research Area: LLM reliability and evaluation, competency assessment

Discipline: Artificial Intelligence, Behavioral Science
People cannot distinguish GPT-4 from a human in a Turing test

Authors: C Jones, B Bergen

Year: 2024

Published in: ArXiv

Institution: University of California San Diego

Research Area: Turing Test, LLM Evaluation, Cognitive Science of AI

Discipline: Artificial Intelligence, Cognitive Science, Human-Computer Interaction (HCI)
Spoken Stereoset: On Evaluating Social Bias Toward Speaker in Speech Large Language Models

Authors: Yi-Cheng Lin, Wei-Chih Chen, Hung-yi Lee

Year: 2024

Published in: ArXiv

Institution: National Taiwan University

Research Area: Speech LLM, Social Bias, Evaluation

Discipline: Artificial Intelligence
Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning in Large Language Models

Authors: Martha Lewis, Melanie Mitchell

Year: 2024

Published in: ArXiv

Institution: Santa Fe Institute, University of Bristol

Research Area: LLM Analogical Reasoning, Counterfactual Evaluation, Generality of AI Reasoning

Discipline: Artificial Intelligence
Human feedback is not gold standard

Authors: T Hosking, P Blunsom, M Bartolo

Year: 2023

Published in: arXiv preprint arXiv:2309.16349, 2023 - arxiv.org

Institution: Cohere, University of Edinburgh, University College London

Research Area: LLM Evaluation, Limitations of Human Preference Scores, Human-Computer Interaction (HCI) in AI Training

Discipline: Artificial Intelligence

DOI: https://doi.org/10.48550/arXiv.2309.16349

Citations: 72