Citations Library

Submit Paper Login

Reinforcement Learning From Human Feedback Rlhf Papers

0 papers in the Reinforcement Learning From Human Feedback Rlhf research area.

Trending:

Behavioral Science

Trending:

Behavioral Science

Disciplines:

Behavioral Science

Political Science

Human-AI Interaction

Topics:

Data Type:

reinforcement-learning-from-human-feedback-rlhf

Reinforcement Learning From Human Feedback Rlhf: Research Area — Prolific Citations Library

Discover 15 peer-reviewed studies in Reinforcement Learning From Human Feedback Rlhf (2023–2025). Explore research findings powered by Prolific's diverse participant panel.

This page lists 15 peer-reviewed papers in the research area of Reinforcement Learning From Human Feedback Rlhf in the Prolific Citations Library, a curated collection of research powered by high-quality human data from Prolific.

Papers (15 of 15)

RLHF deciphered: A critical analysis of reinforcement learning from human feedback for LLMs

Authors: S Chaudhari, P Aggarwal, V Murahari

Year: 2025

Published in: ACM Computing ..., 2025 - dl.acm.org

Institution: University of Massachusetts Amherst, Carnegie Mellon University, Princeton University

Research Area: Reinforcement Learning from Human Feedback (RLHF), Large Language Models

Discipline: Artificial Intelligence

The paper critically analyzes reinforcement learning from human feedback (RLHF) for large language models (LLMs), emphasizing the importance and limitations of reward models in improving human-aligned AI systems.

Methods: Analyzed RLHF frameworks through reinforcement learning principles; conducted a categorical literature review to identify modeling challenges, assumptions, and framework limitations.

Key Findings: Investigated RLHF's fundamentals, focusing on the role of reward models, implications of design choices in RLHF training algorithms, and underlying issues like generalization errors, model misspecification, and feedback sparsity.

Citations: 117
Helpful, harmless, honest? Sociotechnical limits of AI alignment and safety through Reinforcement Learning from Human Feedback: AD Lindström et al.

Authors: A Dahlgren Lindström, L Methnani, L Krause

Year: 2025

Published in: Ethics and Information ..., 2025 - Springer

Institution: Umeå University, Vrije Universiteit Amsterdam

Research Area: AI Alignment, AI Safety, Reinforcement Learning from Human Feedback (RLHF), Sociotechnical Systems

Discipline: Artificial Intelligence, Ethics

The paper critiques AI alignment efforts using RLHF and RLAIF, highlighting theoretical and practical limitations in meeting the goals of helpfulness, harmlessness, and honesty, and advocates for a broader sociotechnical approach to AI safety and ethics.

Methods: Sociotechnical critique of RLHF techniques with an analysis of theoretical frameworks and practical implementations.

Key Findings: The alignment of AI systems with human values and the efficacy of RLHF techniques in achieving the HHH principle (helpfulness, harmlessness, honesty).

DOI: https://doi.org/10.1007/s10676-025-09837-2

Citations: 14
Ethics and Persuasion in Reinforcement Learning from Human Feedback: A Procedural Rhetorical Approach

Authors: S Lodoen, A Orchard

Year: 2025

Published in: arXiv preprint arXiv:2505.09576, 2025 - arxiv.org

Institution: Embry–Riddle Aeronautical University, University of Waterloo

Research Area: Reinforcement Learning from Human Feedback (RLHF), Procedural Rhetoric, LLM Persuasion, Ethics

Discipline: Artificial Intelligence, AI Ethics, Social Science

The paper uses procedural rhetoric to analyze how RLHF reshapes ethical, social, and rhetorical dimensions of generative AI interactions, raising concerns about biases, hegemonic language, and human relationships.

Methods: The study conducts a theoretical and rhetorical analysis based on Ian Bogost's concept of procedural rhetoric, examining how RLHF mechanisms influence language conventions, information practices, and social expectations.

Key Findings: Ethical and rhetorical implications of RLHF-enhanced LLMs on language usage, information seeking, and interpersonal dynamics.

DOI: https://doi.org/10.48550/arXiv.2505.09576

Citations: 3
Benchmarking World-Model Learning

Authors: A Warrier, D Nguyen, M Naim, M Jain, Y Liang, K Schroeder, C Yang, JB Tenenbaum, S Vollmer, K Ellis, Z Tavares

Year: 2025

Published in: 2025 - arXiv preprint arXiv …, 2025 - arxiv.org

Institution: Basis Research Institute, DFKI GmbH, Harvard University, Quebec AI Institute, University of Cambridge, Massachusetts Institute of Technology, Cornell University

Research Area: Agent learning, World Models, Benchmarking, Evaluation protocols, Reinforcement Learning from Human Feedback (RLHF), Large Language Models

Discipline: Computer Science, Artificial Intelligence, Machine Learning

The paper introduces WorldTest, a novel protocol for evaluating model-learning agents using reward-free exploration and behavior-based scoring, and demonstrates that humans outperform models on the AutumnBench suite of tasks, revealing significant gaps in world-model learning.

Methods: The authors proposed WorldTest, a protocol separating reward-free interaction from scored tests in related environments, with evaluations done using AutumnBench—a dataset of 43 grid-world environments and 129 tasks across prediction, planning, and causal dynamics.

Key Findings: Performance of model-learning agents and humans in acquiring world models for masked-frame prediction, planning, and understanding causal dynamics.

Citations: 1

Sample Size: 517
Influencing Humans to Conform to Preference Models for RLHF

Authors: S Hatgis-Kessell, WB Knox, S Booth, S Niekum

Year: 2025

Published in: arXiv preprint arXiv ..., 2025 - arxiv.org

Institution: Stanford University, UMass Amherst, Carnegie Mellon University

Research Area: Reinforcement Learning from Human Feedback (RLHF)

Discipline: Artificial Intelligence, Human-Computer Interaction

The paper investigates whether human preferences can be influenced to align more closely with assumed preference models in RLHF algorithms through interventions such as showing model-derived quantities, training on preference models, and modifying elicitation questions.

Methods: Three human studies were conducted where interventions were tested, including revealing model-derived quantities, training participants on a preference model, and altering how preference questions were framed.

Key Findings: Evaluated the impact of interventions on humans' expression of preferences to align better with the assumed preference models of RLHF algorithms.

DOI: https://doi.org/10.48550/arXiv.2501.06416

Citations: 1
Towards Strategic Persuasion with Language Models

Authors: Z Cheng, J You

Year: 2025

Published in: arXiv preprint arXiv:2509.22989, 2025 - arxiv.org

Institution: University of Southern California, University of California Berkeley

Research Area: Artificial Intelligence, Computer Science, Computer Science and Game Theory, Strategic Persuasion, Reinforcement Learning, Language Models, Large Language Models, Reinforcement Learning from Human Feedback (RLHF)

Discipline: Artificial Intelligence

This paper introduces a scalable framework, utilizing Bayesian Persuasion, to evaluate and train LLMs for strategic persuasion, demonstrating significant persuasion gains and effective strategies through reinforcement learning.

Methods: Repurposed human-human persuasion datasets for evaluation and training; applied Bayesian Persuasion framework; used reinforcement learning to optimize LLMs for strategic persuasion.

Key Findings: The persuasive capabilities and strategies of large language models (LLMs) in various settings.

Citations: 1
A Descriptive and Normative Theory of Human Beliefs in RLHF

Authors: S Dandekar, S Deshmukh, F Chiu, WB Knox

Year: 2025

Published in: arXiv preprint arXiv ..., 2025 - arxiv.org

Institution: University of California, Davis, Northwestern University

Research Area: Reinforcement Learning from Human Feedback (RLHF), Human-AI Interaction, AI Theory

Discipline: Artificial Intelligence, Social Science

The paper investigates how human beliefs about agent capabilities influence preferences in RLHF, proposing a model to minimize the mismatch between beliefs and idealized agent capabilities, ultimately improving policy performance.

Methods: Human studies and synthetic experiments to model and test the impact of belief mismatches on human preferences and RLHF effectiveness.

Key Findings: Effects of human beliefs about agent capabilities on their provided preferences and the performance of RLHF policies.

DOI: https://doi.org/10.48550/arXiv.2506.01692
A survey of reinforcement learning from human feedback

Authors: T Kaufmann, P Weng, V Bengs, E Hüllermeier

Year: 2024

Published in: 2024 - epub.ub.uni-muenchen.de

Institution: Paderborn University, German Research Center for Artificial Intelligence (DFKI), Duke Kunshan University

Research Area: Reinforcement Learning from Human Feedback (RLHF), Large Language Models, Reward Modeling

Discipline: Artificial Intelligence

This paper surveys the fundamentals, diverse applications, and evolving impact of reinforcement learning from human feedback (RLHF), emphasizing its role in improving intelligent system alignment and performance.

Methods: The paper utilizes a survey-based approach to synthesize existing research, exploring the interactions between reinforcement learning algorithms and human input.

Key Findings: The study examines the principles, dynamics, applications, and trends in RLHF, offering insights into its role in enhancing large language models (LLMs) and intelligent systems.

Citations: 354
The inadequacy of reinforcement learning from human feedback - radicalizing large language models via semantic vulnerabilities

Authors: TR McIntosh, T Susnjak, T Liu, P Watters

Year: 2024

Published in: ... on Cognitive and ..., 2024 - ieeexplore.ieee.org

Institution: Cyberoo, Massey University, Cyberstronomy, RMIT University

Research Area: Semantic Vulnerabilities in LLMs, Ideological Manipulation, Reinforcement Learning from Human Feedback (RLHF) Limitations

Discipline: Computer Science, Artificial Intelligence, Machine Learning

RLHF mechanisms are insufficient to prevent semantic manipulation of LLMs, allowing them to express extreme ideological viewpoints when subjected to targeted conditioning techniques.

Methods: Psychological semantic conditioning techniques were applied to assess the susceptibility of LLMs to ideological manipulation.

Key Findings: The ability of LLMs to resist or adopt extreme ideological viewpoints under semantic conditioning.

Citations: 219
Transforming human interactions with AI via reinforcement learning with human feedback (RLHF)

Authors: GKM Liu

Year: 2024

Published in: Massachusetts Institute of Technology, 2023 - computing.mit.edu

Institution: Massachusetts Institute of Technology

Research Area: Reinforcement Learning from Human Feedback (RLHF), Human-AI Interaction

Discipline: Artificial Intelligence

The paper explores Reinforcement Learning with Human Feedback (RLHF) as a transformative tool to align AI with human values, mitigate bias, and democratize technology, while emphasizing its societal implications and ethical considerations.

Methods: The paper employs a systematic study of existing and potential societal effects of RLHF, guided by key questions addressing ethical, social, and practical impacts.

Key Findings: The study investigates how RLHF affects information integrity, societal values, social equity, access to AI, cultural relations, industrial transformation, and labor dynamics.

Citations: 17
Interactive Groupwise Comparison for Faster Reinforcement Learning from Human Feedback

Authors: J Kompatscher

Year: 2024

Published in: 2024 - aaltodoc.aalto.fi

Research Area: Reinforcement Learning from Human Feedback (RLHF), Human-Computer Interaction, Machine Learning (ML)

Discipline: Computer Science

DOI: https://urn.fi/URN:NBN:fi:aalto-202501271897

Citations: 1
Mitigating Bias in Reinforcement Learning from Human Feedback for Large Language Models

Authors: C Ravulu, R Sarabu, M Suryadevara

Year: 2024

Published in: ... Conference on AI x ..., 2024 - ieeexplore.ieee.org

Institution: International Institute of Information Technology, University of California Santa Cruz, University of South Carolina Aiken

Research Area: Reinforcement Learning from Human Feedback (RLHF), Bias Mitigation, Large Language Models, AI Bias

Discipline: Artificial Intelligence

DOI: https://ieeexplore.ieee.org/abstract/document/10990073/
Relying on the Unreliable: The Impact of Language Models' Reluctance to Express Uncertainty

Authors: K Zhou,JD Hwang, X Ren,M Sap

Year: 2024

Published in: ArXiv

Institution: Allen Institute for AI, Carnegie Mellon University, Stanford University, University of Southern California

Research Area: LLM Reliability and Uncertainty Quantification, Reinforcement Learning from Human Feedback (RLHF), Large Language Models

Discipline: Artificial Intelligence
Open problems and fundamental limitations of reinforcement learning from human feedback

Authors: S Casper, X Davies, C Shi, TK Gilbert

Year: 2023

Published in: arXiv preprint arXiv ..., 2023 - arxiv.org

Institution: Columbia University, Cornell Tech, Apollo Research, ETH Zurich, UC Berkeley, University of Sussex, Independent

Research Area: Reinforcement Learning from Human Feedback (RLHF), Alignment, LLM Limitations

Discipline: Artificial Intelligence

DOI: https://doi.org/10.48550/arXiv.2307.15217

Citations: 848
Safe RLHF: Safe reinforcement learning from human feedback

Authors: J Dai, X Pan, R Sun, J Ji, X Xu, M Liu, Y Wang

Year: 2023

Published in: arXiv preprint arXiv ..., 2023 - arxiv.org

Institution: Cornell University, Georgia Institute of Technology

Research Area: Reinforcement Learning from Human Feedback (RLHF), Safe AI, Reinforcement Learning

Discipline: Artificial Intelligence, Machine Learning

DOI: https://doi.org/10.48550/arXiv.2310.12773

Citations: 598