Discover 15 peer-reviewed studies in Rlhf (2023–2025). Explore research findings powered by Prolific's diverse participant panel.
This page lists 15 peer-reviewed papers in the research area of Rlhf in the Prolific Citations Library, a curated collection of research powered by high-quality human data from Prolific.
-
Authors: S Chaudhari, P Aggarwal, V Murahari
Year: 2025
Published in: ACM Computing ..., 2025 - dl.acm.org
Institution: University of Massachusetts Amherst, Carnegie Mellon University, Princeton University
Research Area: Reinforcement Learning from Human Feedback (RLHF), LLM, RLHF
Discipline: Artificial Intelligence
The paper critically analyzes reinforcement learning from human feedback (RLHF) for large language models (LLMs), emphasizing the importance and limitations of reward models in improving human-aligned AI systems.
Methods: Analyzed RLHF frameworks through reinforcement learning principles; conducted a categorical literature review to identify modeling challenges, assumptions, and framework limitations.
Key Findings: Investigated RLHF's fundamentals, focusing on the role of reward models, implications of design choices in RLHF training algorithms, and underlying issues like generalization errors, model misspecification, and feedback sparsity.
Citations: 117
-
Authors: A Dahlgren Lindström, L Methnani, L Krause
Year: 2025
Published in: Ethics and Information ..., 2025 - Springer
Institution: Umeå University, Vrije Universiteit Amsterdam
Research Area: AI Alignment, AI Safety, Reinforcement Learning from Human Feedback (RLHF), Sociotechnical Systems
Discipline: Artificial Intelligence, Ethics
The paper critiques AI alignment efforts using RLHF and RLAIF, highlighting theoretical and practical limitations in meeting the goals of helpfulness, harmlessness, and honesty, and advocates for a broader sociotechnical approach to AI safety and ethics.
Methods: Sociotechnical critique of RLHF techniques with an analysis of theoretical frameworks and practical implementations.
Key Findings: The alignment of AI systems with human values and the efficacy of RLHF techniques in achieving the HHH principle (helpfulness, harmlessness, honesty).
DOI: https://doi.org/10.1007/s10676-025-09837-2
Citations: 14
-
Authors: S Lodoen, A Orchard
Year: 2025
Published in: arXiv preprint arXiv:2505.09576, 2025 - arxiv.org
Institution: Embry–Riddle Aeronautical University, University of Waterloo
Research Area: Reinforcement Learning from Human Feedback (RLHF), Procedural Rhetoric, LLM Persuasion, Ethics
Discipline: Artificial Intelligence, AI Ethics, Social Science
The paper uses procedural rhetoric to analyze how RLHF reshapes ethical, social, and rhetorical dimensions of generative AI interactions, raising concerns about biases, hegemonic language, and human relationships.
Methods: The study conducts a theoretical and rhetorical analysis based on Ian Bogost's concept of procedural rhetoric, examining how RLHF mechanisms influence language conventions, information practices, and social expectations.
Key Findings: Ethical and rhetorical implications of RLHF-enhanced LLMs on language usage, information seeking, and interpersonal dynamics.
DOI: https://doi.org/10.48550/arXiv.2505.09576
Citations: 3
-
Authors: A Warrier, D Nguyen, M Naim, M Jain, Y Liang, K Schroeder, C Yang, JB Tenenbaum, S Vollmer, K Ellis, Z Tavares
Year: 2025
Published in: 2025 - arXiv preprint arXiv …, 2025 - arxiv.org
Institution: Basis Research Institute, DFKI GmbH, Harvard University, Quebec AI Institute, University of Cambridge, Massachusetts Institute of Technology, Cornell University
Research Area: Agent learning, World Models, Benchmarking, Evaluation protocols, RLHF, LLM
Discipline: Computer Science, Artificial Intelligence, Machine Learning
The paper introduces WorldTest, a novel protocol for evaluating model-learning agents using reward-free exploration and behavior-based scoring, and demonstrates that humans outperform models on the AutumnBench suite of tasks, revealing significant gaps in world-model learning.
Methods: The authors proposed WorldTest, a protocol separating reward-free interaction from scored tests in related environments, with evaluations done using AutumnBench—a dataset of 43 grid-world environments and 129 tasks across prediction, planning, and causal dynamics.
Key Findings: Performance of model-learning agents and humans in acquiring world models for masked-frame prediction, planning, and understanding causal dynamics.
Citations: 1
Sample Size: 517
-
Authors: S Hatgis-Kessell, WB Knox, S Booth, S Niekum
Year: 2025
Published in: arXiv preprint arXiv ..., 2025 - arxiv.org
Institution: Stanford University, UMass Amherst, Carnegie Mellon University
Research Area: Reinforcement Learning with Human Feedback (RLHF)
Discipline: Artificial Intelligence, Human-Computer Interaction (HCI)
The paper investigates whether human preferences can be influenced to align more closely with assumed preference models in RLHF algorithms through interventions such as showing model-derived quantities, training on preference models, and modifying elicitation questions.
Methods: Three human studies were conducted where interventions were tested, including revealing model-derived quantities, training participants on a preference model, and altering how preference questions were framed.
Key Findings: Evaluated the impact of interventions on humans' expression of preferences to align better with the assumed preference models of RLHF algorithms.
DOI: https://doi.org/10.48550/arXiv.2501.06416
Citations: 1
-
Authors: Z Cheng, J You
Year: 2025
Published in: arXiv preprint arXiv:2509.22989, 2025 - arxiv.org
Institution: University of Southern California, University of California Berkeley
Research Area: Artificial Intelligence, Computers and Society, Computer Science and Game Theory, Strategic Persuasion, Reinforcement Learning, Language Models, LLM, RLHF
Discipline: Artificial Intelligence
This paper introduces a scalable framework, utilizing Bayesian Persuasion, to evaluate and train LLMs for strategic persuasion, demonstrating significant persuasion gains and effective strategies through reinforcement learning.
Methods: Repurposed human-human persuasion datasets for evaluation and training; applied Bayesian Persuasion framework; used reinforcement learning to optimize LLMs for strategic persuasion.
Key Findings: The persuasive capabilities and strategies of large language models (LLMs) in various settings.
Citations: 1
-
Authors: S Dandekar, S Deshmukh, F Chiu, WB Knox
Year: 2025
Published in: arXiv preprint arXiv ..., 2025 - arxiv.org
Institution: University of California, Davis, Northwestern University
Research Area: Reinforcement Learning from Human Feedback (RLHF), Human-AI Interaction, AI Theory
Discipline: Artificial Intelligence, Social Science
The paper investigates how human beliefs about agent capabilities influence preferences in RLHF, proposing a model to minimize the mismatch between beliefs and idealized agent capabilities, ultimately improving policy performance.
Methods: Human studies and synthetic experiments to model and test the impact of belief mismatches on human preferences and RLHF effectiveness.
Key Findings: Effects of human beliefs about agent capabilities on their provided preferences and the performance of RLHF policies.
DOI: https://doi.org/10.48550/arXiv.2506.01692
-
Authors: T Kaufmann, P Weng, V Bengs, E Hüllermeier
Year: 2024
Published in: 2024 - epub.ub.uni-muenchen.de
Institution: Paderborn University, German Research Center for Artificial Intelligence (DFKI), Duke Kunshan University
Research Area: Reinforcement Learning from Human Feedback (RLHF), LLM, Reward Modeling
Discipline: Artificial Intelligence
This paper surveys the fundamentals, diverse applications, and evolving impact of reinforcement learning from human feedback (RLHF), emphasizing its role in improving intelligent system alignment and performance.
Methods: The paper utilizes a survey-based approach to synthesize existing research, exploring the interactions between reinforcement learning algorithms and human input.
Key Findings: The study examines the principles, dynamics, applications, and trends in RLHF, offering insights into its role in enhancing large language models (LLMs) and intelligent systems.
Citations: 354
-
Authors: TR McIntosh, T Susnjak, T Liu, P Watters
Year: 2024
Published in: ... on Cognitive and ..., 2024 - ieeexplore.ieee.org
Institution: Cyberoo, Massey University, Cyberstronomy, RMIT University
Research Area: Semantic Vulnerabilities in LLMs, Ideological Manipulation, Reinforcement Learning from Human Feedback (RLHF) Limitations
Discipline: Computer Science, Artificial Intelligence, Machine Learning
RLHF mechanisms are insufficient to prevent semantic manipulation of LLMs, allowing them to express extreme ideological viewpoints when subjected to targeted conditioning techniques.
Methods: Psychological semantic conditioning techniques were applied to assess the susceptibility of LLMs to ideological manipulation.
Key Findings: The ability of LLMs to resist or adopt extreme ideological viewpoints under semantic conditioning.
Citations: 219
-
Authors: GKM Liu
Year: 2024
Published in: Massachusetts Institute of Technology, 2023 - computing.mit.edu
Institution: Massachusetts Institute of Technology
Research Area: Reinforcement Learning with Human Feedback (RLHF), Human-AI Interaction
Discipline: Artificial Intelligence
The paper explores Reinforcement Learning with Human Feedback (RLHF) as a transformative tool to align AI with human values, mitigate bias, and democratize technology, while emphasizing its societal implications and ethical considerations.
Methods: The paper employs a systematic study of existing and potential societal effects of RLHF, guided by key questions addressing ethical, social, and practical impacts.
Key Findings: The study investigates how RLHF affects information integrity, societal values, social equity, access to AI, cultural relations, industrial transformation, and labor dynamics.
Citations: 17
-
Authors: J Kompatscher
Year: 2024
Published in: 2024 - aaltodoc.aalto.fi
Research Area: Reinforcement Learning from Human Feedback (RLHF), Human-Computer Interaction (HCI), Machine Learning (ML)
Discipline: Computer Science
DOI: https://urn.fi/URN:NBN:fi:aalto-202501271897
Citations: 1
-
Authors: C Ravulu, R Sarabu, M Suryadevara
Year: 2024
Published in: ... Conference on AI x ..., 2024 - ieeexplore.ieee.org
Institution: International Institute of Information Technology, University of California Santa Cruz, University of South Carolina Aiken
Research Area: Reinforcement Learning from Human Feedback (RLHF), Bias Mitigation, LLM, AI Bias
Discipline: Artificial Intelligence
DOI: https://ieeexplore.ieee.org/abstract/document/10990073/
-
Authors: K Zhou,JD Hwang, X Ren,M Sap
Year: 2024
Published in: ArXiv
Institution: Allen Institute for AI, Carnegie Mellon University, Stanford University, University of Southern California
Research Area: LLM Reliability and Uncertainty Quantification, Reinforcement Learning from Human Feedback (RLHF), LLM
Discipline: Artificial Intelligence
-
Authors: S Casper, X Davies, C Shi, TK Gilbert
Year: 2023
Published in: arXiv preprint arXiv ..., 2023 - arxiv.org
Institution: Columbia University, Cornell Tech, Apollo Research, ETH Zurich, UC Berkeley, University of Sussex, Independent
Research Area: Reinforcement Learning from Human Feedback (RLHF), Alignment, LLM Limitations
Discipline: Artificial Intelligence
DOI: https://doi.org/10.48550/arXiv.2307.15217
Citations: 848
-
Authors: J Dai, X Pan, R Sun, J Ji, X Xu, M Liu, Y Wang
Year: 2023
Published in: arXiv preprint arXiv ..., 2023 - arxiv.org
Institution: Cornell University, Georgia Institute of Technology
Research Area: Reinforcement Learning from Human Feedback (RLHF), Safe AI, Reinforcement Learning
Discipline: Artificial Intelligence, Machine Learning
DOI: https://doi.org/10.48550/arXiv.2310.12773
Citations: 598