Discover 11 peer-reviewed studies in Reinforcement Learning From Human Feedback Rlhf (2023–2025). Explore research findings powered by Prolific's diverse participant panel.
This page lists 11 peer-reviewed papers in the research area of Reinforcement Learning From Human Feedback Rlhf in the Prolific Citations Library, a curated collection of research powered by high-quality human data from Prolific.
-
Authors: S Chaudhari, P Aggarwal, V Murahari
Year: 2025
Published in: ACM Computing ..., 2025 - dl.acm.org
Institution: University of Massachusetts Amherst, Carnegie Mellon University, Princeton University
Research Area: Reinforcement Learning from Human Feedback (RLHF), LLM, RLHF
Discipline: Artificial Intelligence
The paper critically analyzes reinforcement learning from human feedback (RLHF) for large language models (LLMs), emphasizing the importance and limitations of reward models in improving human-aligned AI systems.
Methods: Analyzed RLHF frameworks through reinforcement learning principles; conducted a categorical literature review to identify modeling challenges, assumptions, and framework limitations.
Key Findings: Investigated RLHF's fundamentals, focusing on the role of reward models, implications of design choices in RLHF training algorithms, and underlying issues like generalization errors, model misspecification, and feedback sparsity.
Citations: 117
-
Authors: A Dahlgren Lindström, L Methnani, L Krause
Year: 2025
Published in: Ethics and Information ..., 2025 - Springer
Institution: Umeå University, Vrije Universiteit Amsterdam
Research Area: AI Alignment, AI Safety, Reinforcement Learning from Human Feedback (RLHF), Sociotechnical Systems
Discipline: Artificial Intelligence, Ethics
The paper critiques AI alignment efforts using RLHF and RLAIF, highlighting theoretical and practical limitations in meeting the goals of helpfulness, harmlessness, and honesty, and advocates for a broader sociotechnical approach to AI safety and ethics.
Methods: Sociotechnical critique of RLHF techniques with an analysis of theoretical frameworks and practical implementations.
Key Findings: The alignment of AI systems with human values and the efficacy of RLHF techniques in achieving the HHH principle (helpfulness, harmlessness, honesty).
DOI: https://doi.org/10.1007/s10676-025-09837-2
Citations: 14
-
Authors: S Lodoen, A Orchard
Year: 2025
Published in: arXiv preprint arXiv:2505.09576, 2025 - arxiv.org
Institution: Embry–Riddle Aeronautical University, University of Waterloo
Research Area: Reinforcement Learning from Human Feedback (RLHF), Procedural Rhetoric, LLM Persuasion, Ethics
Discipline: Artificial Intelligence, AI Ethics, Social Science
The paper uses procedural rhetoric to analyze how RLHF reshapes ethical, social, and rhetorical dimensions of generative AI interactions, raising concerns about biases, hegemonic language, and human relationships.
Methods: The study conducts a theoretical and rhetorical analysis based on Ian Bogost's concept of procedural rhetoric, examining how RLHF mechanisms influence language conventions, information practices, and social expectations.
Key Findings: Ethical and rhetorical implications of RLHF-enhanced LLMs on language usage, information seeking, and interpersonal dynamics.
DOI: https://doi.org/10.48550/arXiv.2505.09576
Citations: 3
-
Authors: S Dandekar, S Deshmukh, F Chiu, WB Knox
Year: 2025
Published in: arXiv preprint arXiv ..., 2025 - arxiv.org
Institution: University of California, Davis, Northwestern University
Research Area: Reinforcement Learning from Human Feedback (RLHF), Human-AI Interaction, AI Theory
Discipline: Artificial Intelligence, Social Science
The paper investigates how human beliefs about agent capabilities influence preferences in RLHF, proposing a model to minimize the mismatch between beliefs and idealized agent capabilities, ultimately improving policy performance.
Methods: Human studies and synthetic experiments to model and test the impact of belief mismatches on human preferences and RLHF effectiveness.
Key Findings: Effects of human beliefs about agent capabilities on their provided preferences and the performance of RLHF policies.
DOI: https://doi.org/10.48550/arXiv.2506.01692
-
Authors: T Kaufmann, P Weng, V Bengs, E Hüllermeier
Year: 2024
Published in: 2024 - epub.ub.uni-muenchen.de
Institution: Paderborn University, German Research Center for Artificial Intelligence (DFKI), Duke Kunshan University
Research Area: Reinforcement Learning from Human Feedback (RLHF), LLM, Reward Modeling
Discipline: Artificial Intelligence
This paper surveys the fundamentals, diverse applications, and evolving impact of reinforcement learning from human feedback (RLHF), emphasizing its role in improving intelligent system alignment and performance.
Methods: The paper utilizes a survey-based approach to synthesize existing research, exploring the interactions between reinforcement learning algorithms and human input.
Key Findings: The study examines the principles, dynamics, applications, and trends in RLHF, offering insights into its role in enhancing large language models (LLMs) and intelligent systems.
Citations: 354
-
Authors: TR McIntosh, T Susnjak, T Liu, P Watters
Year: 2024
Published in: ... on Cognitive and ..., 2024 - ieeexplore.ieee.org
Institution: Cyberoo, Massey University, Cyberstronomy, RMIT University
Research Area: Semantic Vulnerabilities in LLMs, Ideological Manipulation, Reinforcement Learning from Human Feedback (RLHF) Limitations
Discipline: Computer Science, Artificial Intelligence, Machine Learning
RLHF mechanisms are insufficient to prevent semantic manipulation of LLMs, allowing them to express extreme ideological viewpoints when subjected to targeted conditioning techniques.
Methods: Psychological semantic conditioning techniques were applied to assess the susceptibility of LLMs to ideological manipulation.
Key Findings: The ability of LLMs to resist or adopt extreme ideological viewpoints under semantic conditioning.
Citations: 219
-
Authors: J Kompatscher
Year: 2024
Published in: 2024 - aaltodoc.aalto.fi
Research Area: Reinforcement Learning from Human Feedback (RLHF), Human-Computer Interaction (HCI), Machine Learning (ML)
Discipline: Computer Science
DOI: https://urn.fi/URN:NBN:fi:aalto-202501271897
Citations: 1
-
Authors: C Ravulu, R Sarabu, M Suryadevara
Year: 2024
Published in: ... Conference on AI x ..., 2024 - ieeexplore.ieee.org
Institution: International Institute of Information Technology, University of California Santa Cruz, University of South Carolina Aiken
Research Area: Reinforcement Learning from Human Feedback (RLHF), Bias Mitigation, LLM, AI Bias
Discipline: Artificial Intelligence
DOI: https://ieeexplore.ieee.org/abstract/document/10990073/
-
Authors: K Zhou,JD Hwang, X Ren,M Sap
Year: 2024
Published in: ArXiv
Institution: Allen Institute for AI, Carnegie Mellon University, Stanford University, University of Southern California
Research Area: LLM Reliability and Uncertainty Quantification, Reinforcement Learning from Human Feedback (RLHF), LLM
Discipline: Artificial Intelligence
-
Authors: S Casper, X Davies, C Shi, TK Gilbert
Year: 2023
Published in: arXiv preprint arXiv ..., 2023 - arxiv.org
Institution: Columbia University, Cornell Tech, Apollo Research, ETH Zurich, UC Berkeley, University of Sussex, Independent
Research Area: Reinforcement Learning from Human Feedback (RLHF), Alignment, LLM Limitations
Discipline: Artificial Intelligence
DOI: https://doi.org/10.48550/arXiv.2307.15217
Citations: 848
-
Authors: J Dai, X Pan, R Sun, J Ji, X Xu, M Liu, Y Wang
Year: 2023
Published in: arXiv preprint arXiv ..., 2023 - arxiv.org
Institution: Cornell University, Georgia Institute of Technology
Research Area: Reinforcement Learning from Human Feedback (RLHF), Safe AI, Reinforcement Learning
Discipline: Artificial Intelligence, Machine Learning
DOI: https://doi.org/10.48550/arXiv.2310.12773
Citations: 598