Discover 12 peer-reviewed studies in Llm Evaluation (2023–2026). Explore research findings powered by Prolific's diverse participant panel.
This page lists 12 peer-reviewed papers in the research area of Llm Evaluation in the Prolific Citations Library, a curated collection of research powered by high-quality human data from Prolific.
-
Authors: N Petrova, A Gordon, E Blindow
Year: 2026
Published in: Open review
Institution: Prolific
Research Area: Human-centered AI evaluation, Bayesian statistics, Responsible AI, AI alignment, LLM Evaluation
Discipline: Machine Learning, Artificial Intelligence
The study introduces HUMAINE, a multidimensional evaluation framework for LLMs, revealing demographic-specific preference variations and ranking google/gemini-2.5-pro as the top-performing model with a posterior probability of 95.6%.
Methods: Multi-turn naturalistic conversations analyzed using a hierarchical Bayesian Bradley-Terry-Davidson model with post-stratification to census data, stratified across 22 demographic groups.
Key Findings: Performance of 28 LLMs across five human-centric dimensions, accounting for demographic-specific preferences.
Sample Size: 23404
-
Authors: P Schmidtová, O Dušek, S Mahamood
Year: 2025
Published in: ArXiv
Institution: Charles University, Trivago
Research Area: Summarization evaluation, Natural Language Processing, LLM-as-a-Judge, AI Evaluation
Discipline: Natural Language Processing
Simpler metrics like word overlap surprisingly correlate well with human judgments in summarization evaluation, outperforming complex methods in out-of-domain applications, though LLMs remain unreliable for assessment due to annotation biases.
Methods: Human evaluation campaigns with categorical error assessment, span-level annotations, and comparison of traditional metrics, trainable models, and LLM-as-a-judge approaches.
Key Findings: Effectiveness of summarization evaluation methods and their correlation with human judgment, along with business impacts of incorrect information in generated summaries.
Citations: 1
-
Authors: J Szczuka, L Mühl, P Ebner, S Dubé
Year: 2025
Published in: ArXiv
Institution: University of Duisburg-Essen
Research Area: Human-Computer Interaction, Social Psychology, Interpersonal Relationships with AI, LLM Evaluation
Discipline: Social Science
Participants rated AI-generated dating profile responses equally as human-like in terms of closeness and romantic interest, challenging assumptions about authenticity in online communication.
Methods: Participants evaluated 10 AI-generated responses to an interpersonal closeness task in a matchmaking scenario, without knowing the responses were AI-generated.
Key Findings: Impact of perceived response source (human vs AI) on interpersonal closeness and romantic interest; influence of perceived quality and human-likeness.
Sample Size: 307
-
Authors: T Davidson
Year: 2025
Published in: Nature Human Behaviour, 2025 - nature.com
Institution: University of Oxford, Davidson College
Research Area: Hate Speech Evaluation, Multimodal LLMs, Social Bias, Computational Law, AI Bias, AI Evaluation
Discipline: Artificial Intelligence
The study demonstrates that larger multimodal large language models (MLLMs) can align closely with human judgement in context-sensitive hate speech evaluations, though they still exhibit biases and limitations.
Methods: Conjoint experiments where simulated social media posts varying in attributes like slur usage and user demographics were evaluated by MLLMs and compared to human judgements.
Key Findings: The capacity of MLLMs to evaluate hate speech in a context-sensitive manner and their alignment with human judgement, while assessing biases and responsiveness to contextual cues.
Sample Size: 1854
-
Authors: Z Qiu, W Liu, H Feng, Z Liu, T Xiao
Year: 2024
Published in: ArXiv
Institution: Massachusetts Institute of Technology, Max Planck Institute, University of Cambridge
Research Area: Computational cognition, LLM evaluation, Program synthesis, Multimodal reasoning
Discipline: Artificial Intelligence
Introduces SGP-Bench, a benchmark testing whether LLMs can answer semantic and spatial questions about images purely from graphics programs (SVG/CAD), effectively probing “visual imagination without vision.” The authors show current LLMs struggle - sometimes near chance - even when images are trivial for humans, but demonstrate that Symbolic Instruction Tuning (SIT) can meaningfully improve thi...
-
Authors: Jen-tse Huang, Man Ho Lam, Eric John Li, Shujie Ren, Wenxuan Wang, Wenxiang Jiao, Zhaopeng Tu, Michael R. Lyu
Year: 2024
Published in: Preprint
Institution: Chinese University of Hong Kong, Tianjin Medical University
Research Area: LLM Emotional Evaluation, Affective Computing, Artificial Intelligence in Psychology
Discipline: Artificial Intelligence
-
Authors: Mete Ismayilzada1,2, Claire Stevenson3, Lonneke van der Plas
Year: 2024
Published in: ArXiv
Institution: Idiap Research Institute, University of Amsterdam, Università della Svizzera Italiana, École Polytechnique Fédérale de Lausanne
Research Area: Creative Story Generation, LLM Evaluation, Computational Creativity
Discipline: Artificial Intelligence, Natural Language Processing, Computational Creativity
-
Authors: Lexin Zhou, Wout Schellaert, Fernando Martínez-Plumed, Yael Moros-Daval, Cèsar Ferri & José Hernández-Orallo
Year: 2024
Published in: Nature
Institution: Universitat Politècnica de València, University of Cambridge, ValGRAI
Research Area: LLM reliability and evaluation, competency assessment
Discipline: Artificial Intelligence, Behavioral Science
-
Authors: C Jones, B Bergen
Year: 2024
Published in: ArXiv
Institution: University of California San Diego
Research Area: Turing Test, LLM Evaluation, Cognitive Science of AI
Discipline: Artificial Intelligence, Cognitive Science, Human-Computer Interaction
-
Authors: Yi-Cheng Lin, Wei-Chih Chen, Hung-yi Lee
Year: 2024
Published in: ArXiv
Institution: National Taiwan University
Research Area: Speech LLM, Social Bias, Evaluation
Discipline: Artificial Intelligence
-
Authors: Martha Lewis, Melanie Mitchell
Year: 2024
Published in: ArXiv
Institution: Santa Fe Institute, University of Bristol
Research Area: LLM Analogical Reasoning, Counterfactual Evaluation, Generality of AI Reasoning
Discipline: Artificial Intelligence
-
Authors: T Hosking, P Blunsom, M Bartolo
Year: 2023
Published in: arXiv preprint arXiv:2309.16349, 2023 - arxiv.org
Institution: Cohere, University of Edinburgh, University College London
Research Area: LLM Evaluation, Limitations of Human Preference Scores, Human-Computer Interaction (HCI) in AI Training
Discipline: Artificial Intelligence
DOI: https://doi.org/10.48550/arXiv.2309.16349
Citations: 72