Discover 75 peer-reviewed studies in Llm (2025–2026). Explore research findings powered by Prolific's diverse participant panel.
This page lists 75 peer-reviewed papers in the research area of Llm in the Prolific Citations Library, a curated collection of research powered by high-quality human data from Prolific.
-
Authors: N Petrova, A Gordon, E Blindow
Year: 2026
Published in: Open review
Institution: Prolific
Research Area: Human-centered AI evaluation, Bayesian statistics, Responsible AI, AI alignment, LLM Evaluation
Discipline: Machine Learning, Artificial Intelligence
The study introduces HUMAINE, a multidimensional evaluation framework for LLMs, revealing demographic-specific preference variations and ranking google/gemini-2.5-pro as the top-performing model with a posterior probability of 95.6%.
Methods: Multi-turn naturalistic conversations analyzed using a hierarchical Bayesian Bradley-Terry-Davidson model with post-stratification to census data, stratified across 22 demographic groups.
Key Findings: Performance of 28 LLMs across five human-centric dimensions, accounting for demographic-specific preferences.
Sample Size: 23404
-
Authors: S Chaudhari, P Aggarwal, V Murahari
Year: 2025
Published in: ACM Computing ..., 2025 - dl.acm.org
Institution: University of Massachusetts Amherst, Carnegie Mellon University, Princeton University
Research Area: Reinforcement Learning from Human Feedback (RLHF), LLM, RLHF
Discipline: Artificial Intelligence
The paper critically analyzes reinforcement learning from human feedback (RLHF) for large language models (LLMs), emphasizing the importance and limitations of reward models in improving human-aligned AI systems.
Methods: Analyzed RLHF frameworks through reinforcement learning principles; conducted a categorical literature review to identify modeling challenges, assumptions, and framework limitations.
Key Findings: Investigated RLHF's fundamentals, focusing on the role of reward models, implications of design choices in RLHF training algorithms, and underlying issues like generalization errors, model misspecification, and feedback sparsity.
Citations: 117
-
Authors: LM Schulze Buschoff, E Akata, M Bethge
Year: 2025
Published in: Nature Machine ..., 2025 - nature.com
Institution: Max Planck Institute
Research Area: Visual Cognition, Multimodal Large Language Models (MLLMs), Vision-Language Models (VLMs)
Discipline: Cognitive Science, Artificial Intelligence, Computer Vision
Vision-based large language models show proficiency in visual data interpretation but fall short in human-like abilities for causal reasoning, intuitive physics, and social cognition.
Methods: Controlled experiments evaluating model performance on tasks related to intuitive physics, causal reasoning, and intuitive psychology using visual processing benchmarks.
Key Findings: Model capabilities in understanding physical interactions, causal relationships, and social preferences.
DOI: https://doi.org/10.1038/s42256-024-00963-y
Citations: 70
-
Authors: F Salvi, M Horta Ribeiro, R Gallotti, R West
Year: 2025
Published in: Nature Human Behaviour, 2025 - nature.com
Institution: EPFL, Fondazione Bruno Kessle, Princeton University
Research Area: Conversational Persuasion of LLM, Human-Computer Interaction (HCI), Behavioral Science, LLM
Discipline: Behavioral Science
GPT-4 can use personalized arguments to be more persuasive in debates, outperforming humans in 64.4% of AI-human comparisons when personalization is applied.
Methods: Preregistered controlled study involving multiround debates with random assignment to conditions focusing on AI-human comparisons, personalization, and opinion strength.
Key Findings: Effectiveness of persuasion by GPT-4, especially when using personalized arguments, compared to humans in debates.
Citations: 65
Sample Size: 900
-
Authors: SSY Kim, JW Vaughan, QV Liao, T Lombrozo
Year: 2025
Published in: Proceedings of the ..., 2025 - dl.acm.org
Institution: Wake Forest University, University of Illinois at Urbana-Champaign, Princeton University, University of California Berkeley
Research Area: Appropriate Reliance on LLMs, Explainable AI, Human-AI Interaction, Cognitive Psychology
Discipline: Cognitive Psychology, Artificial Intelligence, Human-Computer Interaction (HCI)
The study examines factors that influence users' reliance on LLM responses, finding explanations increase reliance, while sources and inconsistent explanations reduce reliance on incorrect responses.
Methods: Think-aloud study followed by a pre-registered, controlled experiment to assess the impact of explanations, sources, and inconsistencies in LLM responses on user reliance.
Key Findings: Users' reliance on LLM responses, accuracy, and the influence of explanations, inconsistencies, and sources on these measures.
DOI: https://doi.org/10.1145/3706598.3714020
Citations: 38
Sample Size: 308
-
Authors: H Bai, JG Voelkel, S Muldowney, JC Eichstaedt
Year: 2025
Published in: Nature ..., 2025 - nature.com
Institution: Stanford University
Research Area: Political Persuasion, LLM
Discipline: Computational Social Science
LLM-generated messages can effectively persuade humans on policy issues similarly to human-crafted messages, with differences in perceived persuasion mechanisms.
Methods: Three pre-registered experiments were conducted comparing the persuasive effectiveness of LLM-generated and human-generated messages on policy attitudes, using control conditions with neutral messages.
Key Findings: Influence of LLM-generated messages on participants' policy attitudes and perceived characteristics of the message authors.
Citations: 37
Sample Size: 4829
-
Authors: K Hackenburg, L Ibrahim, BM Tappin, M Tsakiris
Year: 2025
Published in: AI & SOCIETY, 2025 - Springer
Institution: Oxford Internet Institute, University of Oxford
Research Area: Political Communication and Persuasion, LLM
Discipline: Political Science, Artificial Intelligence
GPT-4's ability to generate persuasive messages rivaled human experts on polarized US political issues, suggesting AI tools may have significant implications for political campaigns and democracy.
Methods: Pre-registered experiment where GPT-4 generated partisan role-playing persuasive messages, which were compared to those from human persuasion experts.
Key Findings: Persuasive impact of GPT-4-generated messages versus human expert messages on U.S. political issues.
Citations: 35
Sample Size: 4955
-
Authors: T Zhang, A Koutsoumpis, JK Oostrom
Year: 2025
Published in: IEEE Transactions ..., 2024 - ieeexplore.ieee.org
Institution: Southeast University, Vrije Universiteit, Tilburg University
Research Area: LLM Personality Assessment, Human-AI Interaction, LLM
Discipline: Human-AI Interaction, Social Science, Humanities
LLMs like GPT-3.5 and GPT-4 can rival or outperform task-specific AI models in assessing personality traits from asynchronous video interviews, but show uneven performance, low reliability, and potential biases, warranting cautious use in high-stakes scenarios.
Methods: The study evaluated GPT-3.5 and GPT-4 performance in assessing personality traits and interview performance using simulated AVI responses, comparing them with ratings from task-specific AI and human annotators.
Key Findings: Validity, reliability, fairness, and rating patterns of LLMs (GPT-3.5 and GPT-4) in personality assessment from asynchronous video interviews.
Citations: 31
Sample Size: 685
-
Authors: K Hackenburg, BM Tappin, P Röttger, SA Hale
Year: 2025
Published in: Proceedings of the ..., 2025 - pnas.org
Institution: University of California Berkeley, University of Cambridge, University of Oxford, Max Planck Institute
Research Area: Political Persuasion, LLM
Discipline: Computational Social Science, Political Science
Scaling language model sizes leads to diminishing returns in generating persuasive political messages, with larger models providing minimal gains compared to smaller ones after controlling for task completion metrics like coherence and relevance.
Methods: Generated 720 political messages using 24 LLMs of varying sizes and tested their persuasiveness through a large-scale randomized survey experiment.
Key Findings: Persuasive capability of language models across different sizes in generating political messages.
Citations: 31
Sample Size: 25982
-
Authors: JY Bo, S Wan, A Anderson
Year: 2025
Published in: Proceedings of the 2025 CHI Conference ..., 2025 - dl.acm.org
Institution: University of Toronto
Research Area: Appropriate reliance on LLM, Human-Computer Interaction (HCI), AI-assisted decision making.
Discipline: Human-Computer Interaction (HCI)
This paper explores the latest advancements and key trends in the field of Human-Computer Interaction (HCI), focusing on novel interfaces and user experience paradigms.
Citations: 25
-
Authors: F Sun, N Li, K Wang, L Goette
Year: 2025
Published in: arXiv preprint arXiv:2505.02151, 2025 - arxiv.org
Institution: HKU Business School
Research Area: LLM Overconfidence and Human Bias Amplification, Bias, LLM
Discipline: Artificial Intelligence, Behavioral Science
Large language models (LLMs) exhibit overconfidence, amplifying human bias, especially in cases where their certainty declines, and their input doubles overconfidence in human decision making despite improving accuracy.
Methods: Algorithmically constructed reasoning problems with known ground truths were used to evaluate LLMs' confidence; comparisons were drawn with human performance using similar experimental protocols.
Key Findings: LLM confidence levels, correctness probabilities, comparison of bias between LLMs and humans, and effects of LLM input on human decision making.
Citations: 21
-
Authors: T Mendel, N Singh, DM Mann, B Wiesenfeld
Year: 2025
Published in: Journal of medical ..., 2025 - jmir.org
Institution: The City University of New York, George Washington University, New York University
Research Area: LLMs in Digital Health, Health Queries, User Attitudes
Discipline: Digital Health
Laypeople primarily use search engines over large language models (LLMs) for health queries, perceiving LLMs as less useful but less biased and more human-like while exhibiting no significant difference in trust or ease of use.
Methods: A screening survey followed by logistic regression analysis and a follow-up survey; comparisons were performed using ANOVA, Tukey post hoc tests, and paired-sample Wilcoxon tests.
Key Findings: Demographics and behaviors of LLM and search engine users for health queries, perceived usefulness, ease of use, trustworthiness, bias, and anthropomorphism.
Citations: 21
Sample Size: 2002
-
Authors: K Hackenburg, BM Tappin, L Hewitt, E Saunders
Year: 2025
Published in: Science, 2025 - science.org
Institution: London School of Economics and Political Science, Stony Brook University
Research Area: Political Persuasion with Conversational AI, LLM, Factual Accuracy in AI Systems.
Discipline: Political Science, Computational Social Science
This Science paper shows that conversational AI chatbots can systematically influence political opinions at scale, and that techniques like post-training and prompting make them far more persuasive—but that increased persuasion is tied to reduced factual accuracy in what the AI says.
Citations: 12
-
Authors: Z Chen, J Kalla, Q Le, S Nakamura-Sakai
Year: 2025
Published in: arXiv preprint arXiv ..., 2025 - arxiv.org
Institution: The affiliated institutions could not be determined from the provided context or an external search of the URL.
Research Area: Artificial Intelligence and Social Science, Persuasion Studies, Political Persuasion, LLM Chatbots, Democratic Societies
Discipline: Artificial Intelligence, Social Science
The study evaluates the cost-effectiveness and persuasive risks of Large Language Model (LLM) chatbots in political contexts, finding that while LLMs are as persuasive as campaign ads under exposure, their large-scale influence is currently limited by scalability and cost barriers.
Methods: Two survey experiments combined with real-world simulation exercises to measure the persuasiveness of LLM chatbots compared to traditional campaign tactics, focusing on both exposure and acceptance phases of persuasion.
Key Findings: Short- and long-term persuasive effects of LLMs, cost-effectiveness of LLM-based persuasion ($48-$74 per persuaded voter), and scalability compared to traditional campaign approaches.
Citations: 7
Sample Size: 10417
-
Authors: TS Behrend, RN Landers
Year: 2025
Published in: Journal of Business and Psychology, 2025 - Springer
Institution: University of Nebraska-Lincoln, University of Minnesota
Research Area: LLM in Behavioral Science Research, AI-Assisted Research Methodology
Discipline: Behavioral Science, Psychology, Artificial Intelligence
The paper proposes a framework with five use cases for integrating large language models into survey and experimental research, introduces the Qualtrics-AI Link (QUAIL) tool, and highlights technical and ethical considerations for using LLMs effectively and validly.
Methods: The paper outlines a decision-making framework for five potential uses of LLMs in survey and experimental design, introduces software (QUAIL) for integrating LLM knowledge into Qualtrics, and details technical steps such as prompt engineering, model testing, and validity monitoring.
Key Findings: Applications, implementation strategies, and ethical considerations of large language models in psychological research material development.
DOI: https://doi.org/10.1007/s10869-025-10035-6
Citations: 6
-
Authors: S Lodoen, A Orchard
Year: 2025
Published in: arXiv preprint arXiv:2505.09576, 2025 - arxiv.org
Institution: Embry–Riddle Aeronautical University, University of Waterloo
Research Area: Reinforcement Learning from Human Feedback (RLHF), Procedural Rhetoric, LLM Persuasion, Ethics
Discipline: Artificial Intelligence, AI Ethics, Social Science
The paper uses procedural rhetoric to analyze how RLHF reshapes ethical, social, and rhetorical dimensions of generative AI interactions, raising concerns about biases, hegemonic language, and human relationships.
Methods: The study conducts a theoretical and rhetorical analysis based on Ian Bogost's concept of procedural rhetoric, examining how RLHF mechanisms influence language conventions, information practices, and social expectations.
Key Findings: Ethical and rhetorical implications of RLHF-enhanced LLMs on language usage, information seeking, and interpersonal dynamics.
DOI: https://doi.org/10.48550/arXiv.2505.09576
Citations: 3
-
Authors: Y Ba, MV Mancenido, EK Chiou, R Pan
Year: 2025
Published in: Behavior Research Methods, 2025 - Springer
Institution: University of Delaware, National Taiwan University, University of British Columbia, Monash University
Research Area: Crowdsourcing, Data Quality, Spamming Behavior Detection, LLM Applications in Behavioral Research
Discipline: Computer Science, Artificial Intelligence, LLM
The paper introduces a systematic method to evaluate crowdsourced data quality and detect spam behaviors through variance decomposition, proposing a spammer index and credibility metrics to improve consistency and reliability in labeling tasks.
Methods: Variance decomposition, Markov chain models, and generalized random effects models were used to assess annotator consistency and credibility; metrics were applied to both simulated and real-world data from two crowdsourcing platforms.
Key Findings: Quality of crowdsourced data, spammer behaviors, annotators’ consistency, and credibility.
Citations: 2
-
Authors: A Karamolegkou, O Eberle, P Rust, C Kauf, A Søgaard
Year: 2025
Published in: ArXiv
Institution: Aleph Alpha, Massachusetts Institute of Technology
Research Area: Adversarial Ambiguity, Language Model Evaluation, Artificial intelligence, Computation and Language, LLM, AI Evaluation, Red Teaming
Discipline: Natural Language Processing
The paper assesses language models' sensitivity to ambiguity using an adversarial dataset and finds that direct prompting poorly identifies ambiguity, while linear probes achieve high accuracy in decoding ambiguity from model representations.
Methods: An adversarial ambiguity dataset was introduced with various types of ambiguities and transformations; models were tested using direct prompts and linear probes trained on internal representations.
Key Findings: Language models' ability to detect ambiguity, including syntactic, lexical, and phonological types, as well as performance under adversarial variations.
Citations: 2
-
Authors: A Warrier, D Nguyen, M Naim, M Jain, Y Liang, K Schroeder, C Yang, JB Tenenbaum, S Vollmer, K Ellis, Z Tavares
Year: 2025
Published in: 2025 - arXiv preprint arXiv …, 2025 - arxiv.org
Institution: Basis Research Institute, DFKI GmbH, Harvard University, Quebec AI Institute, University of Cambridge, Massachusetts Institute of Technology, Cornell University
Research Area: Agent learning, World Models, Benchmarking, Evaluation protocols, RLHF, LLM
Discipline: Computer Science, Artificial Intelligence, Machine Learning
The paper introduces WorldTest, a novel protocol for evaluating model-learning agents using reward-free exploration and behavior-based scoring, and demonstrates that humans outperform models on the AutumnBench suite of tasks, revealing significant gaps in world-model learning.
Methods: The authors proposed WorldTest, a protocol separating reward-free interaction from scored tests in related environments, with evaluations done using AutumnBench—a dataset of 43 grid-world environments and 129 tasks across prediction, planning, and causal dynamics.
Key Findings: Performance of model-learning agents and humans in acquiring world models for masked-frame prediction, planning, and understanding causal dynamics.
Citations: 1
Sample Size: 517
-
Authors: S Liu, Z Cai, H Wang, Z Ma, X Li
Year: 2025
Published in: arXiv preprint arXiv:2505.19134, 2025 - arxiv.org
Institution: Meta, Imperial College London
Research Area: Artificial Intelligence, Crowdsourcing, LLM
Discipline: Artificial Intelligence
The paper develops a principal-agent model to incentivize high-quality human annotations using golden questions and identifies criteria for these questions to effectively monitor annotators' performance.
Methods: The authors use a principal-agent model with maximum likelihood estimators (MLE) and hypothesis testing to design incentive-compatible systems for annotators. Golden questions of high certainty and similar format to normal data were selected and validated through experiments.
Key Findings: The effectiveness of golden questions for incentivizing and monitoring high-quality human annotations in preference data.
DOI: https://doi.org/10.48550/arXiv.2505.19134
Citations: 1