Llm: Research Area — Prolific Citations Library

Discover 75 peer-reviewed studies in Llm (2025–2026). Explore research findings powered by Prolific's diverse participant panel.

This page lists 75 peer-reviewed papers in the research area of Llm in the Prolific Citations Library, a curated collection of research powered by high-quality human data from Prolific.

Papers (20 of 75)

Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework

Authors: N Petrova, A Gordon, E Blindow

Year: 2026

Published in: Open review

Institution: Prolific

Research Area: Human-centered AI evaluation, Bayesian statistics, Responsible AI, AI alignment, LLM Evaluation

Discipline: Machine Learning, Artificial Intelligence

The study introduces HUMAINE, a multidimensional evaluation framework for LLMs, revealing demographic-specific preference variations and ranking google/gemini-2.5-pro as the top-performing model with a posterior probability of 95.6%.

Methods: Multi-turn naturalistic conversations analyzed using a hierarchical Bayesian Bradley-Terry-Davidson model with post-stratification to census data, stratified across 22 demographic groups.

Key Findings: Performance of 28 LLMs across five human-centric dimensions, accounting for demographic-specific preferences.

Sample Size: 23404
RLHF deciphered: A critical analysis of reinforcement learning from human feedback for LLMs

Authors: S Chaudhari, P Aggarwal, V Murahari

Year: 2025

Published in: ACM Computing ..., 2025 - dl.acm.org

Institution: University of Massachusetts Amherst, Carnegie Mellon University, Princeton University

Research Area: Reinforcement Learning from Human Feedback (RLHF), LLM, RLHF

Discipline: Artificial Intelligence

The paper critically analyzes reinforcement learning from human feedback (RLHF) for large language models (LLMs), emphasizing the importance and limitations of reward models in improving human-aligned AI systems.

Methods: Analyzed RLHF frameworks through reinforcement learning principles; conducted a categorical literature review to identify modeling challenges, assumptions, and framework limitations.

Key Findings: Investigated RLHF's fundamentals, focusing on the role of reward models, implications of design choices in RLHF training algorithms, and underlying issues like generalization errors, model misspecification, and feedback sparsity.

Citations: 117
Visual cognition in multimodal large language models

Authors: LM Schulze Buschoff, E Akata, M Bethge

Year: 2025

Published in: Nature Machine ..., 2025 - nature.com

Institution: Max Planck Institute

Research Area: Visual Cognition, Multimodal Large Language Models (MLLMs), Vision-Language Models (VLMs)

Discipline: Cognitive Science, Artificial Intelligence, Computer Vision

Vision-based large language models show proficiency in visual data interpretation but fall short in human-like abilities for causal reasoning, intuitive physics, and social cognition.

Methods: Controlled experiments evaluating model performance on tasks related to intuitive physics, causal reasoning, and intuitive psychology using visual processing benchmarks.

Key Findings: Model capabilities in understanding physical interactions, causal relationships, and social preferences.

DOI: https://doi.org/10.1038/s42256-024-00963-y

Citations: 70
On the conversational persuasiveness of GPT-4

Authors: F Salvi, M Horta Ribeiro, R Gallotti, R West

Year: 2025

Published in: Nature Human Behaviour, 2025 - nature.com

Institution: EPFL, Fondazione Bruno Kessle, Princeton University

Research Area: Conversational Persuasion of LLM, Human-Computer Interaction (HCI), Behavioral Science, LLM

Discipline: Behavioral Science

GPT-4 can use personalized arguments to be more persuasive in debates, outperforming humans in 64.4% of AI-human comparisons when personalization is applied.

Methods: Preregistered controlled study involving multiround debates with random assignment to conditions focusing on AI-human comparisons, personalization, and opinion strength.

Key Findings: Effectiveness of persuasion by GPT-4, especially when using personalized arguments, compared to humans in debates.

Citations: 65

Sample Size: 900
Fostering appropriate reliance on large language models: The role of explanations, sources, and inconsistencies

Authors: SSY Kim, JW Vaughan, QV Liao, T Lombrozo

Year: 2025

Published in: Proceedings of the ..., 2025 - dl.acm.org

Institution: Wake Forest University, University of Illinois at Urbana-Champaign, Princeton University, University of California Berkeley

Research Area: Appropriate Reliance on LLMs, Explainable AI, Human-AI Interaction, Cognitive Psychology

Discipline: Cognitive Psychology, Artificial Intelligence, Human-Computer Interaction (HCI)

The study examines factors that influence users' reliance on LLM responses, finding explanations increase reliance, while sources and inconsistent explanations reduce reliance on incorrect responses.

Methods: Think-aloud study followed by a pre-registered, controlled experiment to assess the impact of explanations, sources, and inconsistencies in LLM responses on user reliance.

Key Findings: Users' reliance on LLM responses, accuracy, and the influence of explanations, inconsistencies, and sources on these measures.

DOI: https://doi.org/10.1145/3706598.3714020

Citations: 38

Sample Size: 308
LLM-generated messages can persuade humans on policy issues

Authors: H Bai, JG Voelkel, S Muldowney, JC Eichstaedt

Year: 2025

Published in: Nature ..., 2025 - nature.com

Institution: Stanford University

Research Area: Political Persuasion, LLM

Discipline: Computational Social Science

LLM-generated messages can effectively persuade humans on policy issues similarly to human-crafted messages, with differences in perceived persuasion mechanisms.

Methods: Three pre-registered experiments were conducted comparing the persuasive effectiveness of LLM-generated and human-generated messages on policy attitudes, using control conditions with neutral messages.

Key Findings: Influence of LLM-generated messages on participants' policy attitudes and perceived characteristics of the message authors.

Citations: 37

Sample Size: 4829
Comparing the persuasiveness of role-playing large language models and human experts on polarized US political issues

Authors: K Hackenburg, L Ibrahim, BM Tappin, M Tsakiris

Year: 2025

Published in: AI & SOCIETY, 2025 - Springer

Institution: Oxford Internet Institute, University of Oxford

Research Area: Political Communication and Persuasion, LLM

Discipline: Political Science, Artificial Intelligence

GPT-4's ability to generate persuasive messages rivaled human experts on polarized US political issues, suggesting AI tools may have significant implications for political campaigns and democracy.

Methods: Pre-registered experiment where GPT-4 generated partisan role-playing persuasive messages, which were compared to those from human persuasion experts.

Key Findings: Persuasive impact of GPT-4-generated messages versus human expert messages on U.S. political issues.

Citations: 35

Sample Size: 4955
Can large language models assess personality from asynchronous video interviews? A comprehensive evaluation of validity, reliability, fairness, and rating patterns

Authors: T Zhang, A Koutsoumpis, JK Oostrom

Year: 2025

Published in: IEEE Transactions ..., 2024 - ieeexplore.ieee.org

Institution: Southeast University, Vrije Universiteit, Tilburg University

Research Area: LLM Personality Assessment, Human-AI Interaction, LLM

Discipline: Human-AI Interaction, Social Science, Humanities

LLMs like GPT-3.5 and GPT-4 can rival or outperform task-specific AI models in assessing personality traits from asynchronous video interviews, but show uneven performance, low reliability, and potential biases, warranting cautious use in high-stakes scenarios.

Methods: The study evaluated GPT-3.5 and GPT-4 performance in assessing personality traits and interview performance using simulated AVI responses, comparing them with ratings from task-specific AI and human annotators.

Key Findings: Validity, reliability, fairness, and rating patterns of LLMs (GPT-3.5 and GPT-4) in personality assessment from asynchronous video interviews.

Citations: 31

Sample Size: 685
Scaling language model size yields diminishing returns for single-message political persuasion

Authors: K Hackenburg, BM Tappin, P Röttger, SA Hale

Year: 2025

Published in: Proceedings of the ..., 2025 - pnas.org

Institution: University of California Berkeley, University of Cambridge, University of Oxford, Max Planck Institute

Research Area: Political Persuasion, LLM

Discipline: Computational Social Science, Political Science

Scaling language model sizes leads to diminishing returns in generating persuasive political messages, with larger models providing minimal gains compared to smaller ones after controlling for task completion metrics like coherence and relevance.

Methods: Generated 720 political messages using 24 LLMs of varying sizes and tested their persuasiveness through a large-scale randomized survey experiment.

Key Findings: Persuasive capability of language models across different sizes in generating political messages.

Citations: 31

Sample Size: 25982
To rely or not to rely? evaluating interventions for appropriate reliance on large language models

Authors: JY Bo, S Wan, A Anderson

Year: 2025

Published in: Proceedings of the 2025 CHI Conference ..., 2025 - dl.acm.org

Institution: University of Toronto

Research Area: Appropriate reliance on LLM, Human-Computer Interaction (HCI), AI-assisted decision making.

Discipline: Human-Computer Interaction (HCI)

This paper explores the latest advancements and key trends in the field of Human-Computer Interaction (HCI), focusing on novel interfaces and user experience paradigms.

Citations: 25
Large Language Models are overconfident and amplify human bias

Authors: F Sun, N Li, K Wang, L Goette

Year: 2025

Published in: arXiv preprint arXiv:2505.02151, 2025 - arxiv.org

Institution: HKU Business School

Research Area: LLM Overconfidence and Human Bias Amplification, Bias, LLM

Discipline: Artificial Intelligence, Behavioral Science

Large language models (LLMs) exhibit overconfidence, amplifying human bias, especially in cases where their certainty declines, and their input doubles overconfidence in human decision making despite improving accuracy.

Methods: Algorithmically constructed reasoning problems with known ground truths were used to evaluate LLMs' confidence; comparisons were drawn with human performance using similar experimental protocols.

Key Findings: LLM confidence levels, correctness probabilities, comparison of bias between LLMs and humans, and effects of LLM input on human decision making.

Citations: 21
Laypeople's use of and attitudes toward large language models and search engines for health queries: survey study

Authors: T Mendel, N Singh, DM Mann, B Wiesenfeld

Year: 2025

Published in: Journal of medical ..., 2025 - jmir.org

Institution: The City University of New York, George Washington University, New York University

Research Area: LLMs in Digital Health, Health Queries, User Attitudes

Discipline: Digital Health

Laypeople primarily use search engines over large language models (LLMs) for health queries, perceiving LLMs as less useful but less biased and more human-like while exhibiting no significant difference in trust or ease of use.

Methods: A screening survey followed by logistic regression analysis and a follow-up survey; comparisons were performed using ANOVA, Tukey post hoc tests, and paired-sample Wilcoxon tests.

Key Findings: Demographics and behaviors of LLM and search engine users for health queries, perceived usefulness, ease of use, trustworthiness, bias, and anthropomorphism.

Citations: 21

Sample Size: 2002
The levers of political persuasion with conversational artificial intelligence

Authors: K Hackenburg, BM Tappin, L Hewitt, E Saunders

Year: 2025

Published in: Science, 2025 - science.org

Institution: London School of Economics and Political Science, Stony Brook University

Research Area: Political Persuasion with Conversational AI, LLM, Factual Accuracy in AI Systems.

Discipline: Political Science, Computational Social Science

This Science paper shows that conversational AI chatbots can systematically influence political opinions at scale, and that techniques like post-training and prompting make them far more persuasive—but that increased persuasion is tied to reduced factual accuracy in what the AI says.

Citations: 12
A Framework to Assess the Persuasion Risks Large Language Model Chatbots Pose to Democratic Societies

Authors: Z Chen, J Kalla, Q Le, S Nakamura-Sakai

Year: 2025

Published in: arXiv preprint arXiv ..., 2025 - arxiv.org

Institution: The affiliated institutions could not be determined from the provided context or an external search of the URL.

Research Area: Artificial Intelligence and Social Science, Persuasion Studies, Political Persuasion, LLM Chatbots, Democratic Societies

Discipline: Artificial Intelligence, Social Science

The study evaluates the cost-effectiveness and persuasive risks of Large Language Model (LLM) chatbots in political contexts, finding that while LLMs are as persuasive as campaign ads under exposure, their large-scale influence is currently limited by scalability and cost barriers.

Methods: Two survey experiments combined with real-world simulation exercises to measure the persuasiveness of LLM chatbots compared to traditional campaign tactics, focusing on both exposure and acceptance phases of persuasion.

Key Findings: Short- and long-term persuasive effects of LLMs, cost-effectiveness of LLM-based persuasion ($48-$74 per persuaded voter), and scalability compared to traditional campaign approaches.

Citations: 7

Sample Size: 10417
Participant Interactions with Artificial Intelligence: Using Large Language Models to Generate Research Materials for Surveys and Experiments

Authors: TS Behrend, RN Landers

Year: 2025

Published in: Journal of Business and Psychology, 2025 - Springer

Institution: University of Nebraska-Lincoln, University of Minnesota

Research Area: LLM in Behavioral Science Research, AI-Assisted Research Methodology

Discipline: Behavioral Science, Psychology, Artificial Intelligence

The paper proposes a framework with five use cases for integrating large language models into survey and experimental research, introduces the Qualtrics-AI Link (QUAIL) tool, and highlights technical and ethical considerations for using LLMs effectively and validly.

Methods: The paper outlines a decision-making framework for five potential uses of LLMs in survey and experimental design, introduces software (QUAIL) for integrating LLM knowledge into Qualtrics, and details technical steps such as prompt engineering, model testing, and validity monitoring.

Key Findings: Applications, implementation strategies, and ethical considerations of large language models in psychological research material development.

DOI: https://doi.org/10.1007/s10869-025-10035-6

Citations: 6
Ethics and Persuasion in Reinforcement Learning from Human Feedback: A Procedural Rhetorical Approach

Authors: S Lodoen, A Orchard

Year: 2025

Published in: arXiv preprint arXiv:2505.09576, 2025 - arxiv.org

Institution: Embry–Riddle Aeronautical University, University of Waterloo

Research Area: Reinforcement Learning from Human Feedback (RLHF), Procedural Rhetoric, LLM Persuasion, Ethics

Discipline: Artificial Intelligence, AI Ethics, Social Science

The paper uses procedural rhetoric to analyze how RLHF reshapes ethical, social, and rhetorical dimensions of generative AI interactions, raising concerns about biases, hegemonic language, and human relationships.

Methods: The study conducts a theoretical and rhetorical analysis based on Ian Bogost's concept of procedural rhetoric, examining how RLHF mechanisms influence language conventions, information practices, and social expectations.

Key Findings: Ethical and rhetorical implications of RLHF-enhanced LLMs on language usage, information seeking, and interpersonal dynamics.

DOI: https://doi.org/10.48550/arXiv.2505.09576

Citations: 3
Data quality in crowdsourcing and spamming behavior detection

Authors: Y Ba, MV Mancenido, EK Chiou, R Pan

Year: 2025

Published in: Behavior Research Methods, 2025 - Springer

Institution: University of Delaware, National Taiwan University, University of British Columbia, Monash University

Research Area: Crowdsourcing, Data Quality, Spamming Behavior Detection, LLM Applications in Behavioral Research

Discipline: Computer Science, Artificial Intelligence, LLM

The paper introduces a systematic method to evaluate crowdsourced data quality and detect spam behaviors through variance decomposition, proposing a spammer index and credibility metrics to improve consistency and reliability in labeling tasks.

Methods: Variance decomposition, Markov chain models, and generalized random effects models were used to assess annotator consistency and credibility; metrics were applied to both simulated and real-world data from two crowdsourcing platforms.

Key Findings: Quality of crowdsourced data, spammer behaviors, annotators’ consistency, and credibility.

Citations: 2
Trick or Neat: Adversarial Ambiguity and Language Model Evaluation

Authors: A Karamolegkou, O Eberle, P Rust, C Kauf, A Søgaard

Year: 2025

Published in: ArXiv

Institution: Aleph Alpha, Massachusetts Institute of Technology

Research Area: Adversarial Ambiguity, Language Model Evaluation, Artificial intelligence, Computation and Language, LLM, AI Evaluation, Red Teaming

Discipline: Natural Language Processing

The paper assesses language models' sensitivity to ambiguity using an adversarial dataset and finds that direct prompting poorly identifies ambiguity, while linear probes achieve high accuracy in decoding ambiguity from model representations.

Methods: An adversarial ambiguity dataset was introduced with various types of ambiguities and transformations; models were tested using direct prompts and linear probes trained on internal representations.

Key Findings: Language models' ability to detect ambiguity, including syntactic, lexical, and phonological types, as well as performance under adversarial variations.

Citations: 2
Benchmarking World-Model Learning

Authors: A Warrier, D Nguyen, M Naim, M Jain, Y Liang, K Schroeder, C Yang, JB Tenenbaum, S Vollmer, K Ellis, Z Tavares

Year: 2025

Published in: 2025 - arXiv preprint arXiv …, 2025 - arxiv.org

Institution: Basis Research Institute, DFKI GmbH, Harvard University, Quebec AI Institute, University of Cambridge, Massachusetts Institute of Technology, Cornell University

Research Area: Agent learning, World Models, Benchmarking, Evaluation protocols, RLHF, LLM

Discipline: Computer Science, Artificial Intelligence, Machine Learning

The paper introduces WorldTest, a novel protocol for evaluating model-learning agents using reward-free exploration and behavior-based scoring, and demonstrates that humans outperform models on the AutumnBench suite of tasks, revealing significant gaps in world-model learning.

Methods: The authors proposed WorldTest, a protocol separating reward-free interaction from scored tests in related environments, with evaluations done using AutumnBench—a dataset of 43 grid-world environments and 129 tasks across prediction, planning, and causal dynamics.

Key Findings: Performance of model-learning agents and humans in acquiring world models for masked-frame prediction, planning, and understanding causal dynamics.

Citations: 1

Sample Size: 517
Incentivizing High-Quality Human Annotations with Golden Questions

Authors: S Liu, Z Cai, H Wang, Z Ma, X Li

Year: 2025

Published in: arXiv preprint arXiv:2505.19134, 2025 - arxiv.org

Institution: Meta, Imperial College London

Research Area: Artificial Intelligence, Crowdsourcing, LLM

Discipline: Artificial Intelligence

The paper develops a principal-agent model to incentivize high-quality human annotations using golden questions and identifies criteria for these questions to effectively monitor annotators' performance.

Methods: The authors use a principal-agent model with maximum likelihood estimators (MLE) and hypothesis testing to design incentive-compatible systems for annotators. Golden questions of high certainty and similar format to normal data were selected and validated through experiments.

Key Findings: The effectiveness of golden questions for incentivizing and monitoring high-quality human annotations in preference data.

DOI: https://doi.org/10.48550/arXiv.2505.19134

Citations: 1