Human Feedback: Research Area — Prolific Citations Library

Discover 17 peer-reviewed studies in Human Feedback (2023–2025). Explore research findings powered by Prolific's diverse participant panel.

This page lists 17 peer-reviewed papers in the research area of Human Feedback in the Prolific Citations Library, a curated collection of research powered by high-quality human data from Prolific.

Papers (17 of 17)

RLHF deciphered: A critical analysis of reinforcement learning from human feedback for LLMs

Authors: S Chaudhari, P Aggarwal, V Murahari

Year: 2025

Published in: ACM Computing ..., 2025 - dl.acm.org

Institution: University of Massachusetts Amherst, Carnegie Mellon University, Princeton University

Research Area: Reinforcement Learning from Human Feedback (RLHF), LLM, RLHF

Discipline: Artificial Intelligence

The paper critically analyzes reinforcement learning from human feedback (RLHF) for large language models (LLMs), emphasizing the importance and limitations of reward models in improving human-aligned AI systems.

Methods: Analyzed RLHF frameworks through reinforcement learning principles; conducted a categorical literature review to identify modeling challenges, assumptions, and framework limitations.

Key Findings: Investigated RLHF's fundamentals, focusing on the role of reward models, implications of design choices in RLHF training algorithms, and underlying issues like generalization errors, model misspecification, and feedback sparsity.

Citations: 117
Helpful, harmless, honest? Sociotechnical limits of AI alignment and safety through Reinforcement Learning from Human Feedback: AD Lindström et al.

Authors: A Dahlgren Lindström, L Methnani, L Krause

Year: 2025

Published in: Ethics and Information ..., 2025 - Springer

Institution: Umeå University, Vrije Universiteit Amsterdam

Research Area: AI Alignment, AI Safety, Reinforcement Learning from Human Feedback (RLHF), Sociotechnical Systems

Discipline: Artificial Intelligence, Ethics

The paper critiques AI alignment efforts using RLHF and RLAIF, highlighting theoretical and practical limitations in meeting the goals of helpfulness, harmlessness, and honesty, and advocates for a broader sociotechnical approach to AI safety and ethics.

Methods: Sociotechnical critique of RLHF techniques with an analysis of theoretical frameworks and practical implementations.

Key Findings: The alignment of AI systems with human values and the efficacy of RLHF techniques in achieving the HHH principle (helpfulness, harmlessness, honesty).

DOI: https://doi.org/10.1007/s10676-025-09837-2

Citations: 14
Ethics and Persuasion in Reinforcement Learning from Human Feedback: A Procedural Rhetorical Approach

Authors: S Lodoen, A Orchard

Year: 2025

Published in: arXiv preprint arXiv:2505.09576, 2025 - arxiv.org

Institution: Embry–Riddle Aeronautical University, University of Waterloo

Research Area: Reinforcement Learning from Human Feedback (RLHF), Procedural Rhetoric, LLM Persuasion, Ethics

Discipline: Artificial Intelligence, AI Ethics, Social Science

The paper uses procedural rhetoric to analyze how RLHF reshapes ethical, social, and rhetorical dimensions of generative AI interactions, raising concerns about biases, hegemonic language, and human relationships.

Methods: The study conducts a theoretical and rhetorical analysis based on Ian Bogost's concept of procedural rhetoric, examining how RLHF mechanisms influence language conventions, information practices, and social expectations.

Key Findings: Ethical and rhetorical implications of RLHF-enhanced LLMs on language usage, information seeking, and interpersonal dynamics.

DOI: https://doi.org/10.48550/arXiv.2505.09576

Citations: 3
Influencing Humans to Conform to Preference Models for RLHF

Authors: S Hatgis-Kessell, WB Knox, S Booth, S Niekum

Year: 2025

Published in: arXiv preprint arXiv ..., 2025 - arxiv.org

Institution: Stanford University, UMass Amherst, Carnegie Mellon University

Research Area: Reinforcement Learning with Human Feedback (RLHF)

Discipline: Artificial Intelligence, Human-Computer Interaction (HCI)

The paper investigates whether human preferences can be influenced to align more closely with assumed preference models in RLHF algorithms through interventions such as showing model-derived quantities, training on preference models, and modifying elicitation questions.

Methods: Three human studies were conducted where interventions were tested, including revealing model-derived quantities, training participants on a preference model, and altering how preference questions were framed.

Key Findings: Evaluated the impact of interventions on humans' expression of preferences to align better with the assumed preference models of RLHF algorithms.

DOI: https://doi.org/10.48550/arXiv.2501.06416

Citations: 1
A Descriptive and Normative Theory of Human Beliefs in RLHF

Authors: S Dandekar, S Deshmukh, F Chiu, WB Knox

Year: 2025

Published in: arXiv preprint arXiv ..., 2025 - arxiv.org

Institution: University of California, Davis, Northwestern University

Research Area: Reinforcement Learning from Human Feedback (RLHF), Human-AI Interaction, AI Theory

Discipline: Artificial Intelligence, Social Science

The paper investigates how human beliefs about agent capabilities influence preferences in RLHF, proposing a model to minimize the mismatch between beliefs and idealized agent capabilities, ultimately improving policy performance.

Methods: Human studies and synthetic experiments to model and test the impact of belief mismatches on human preferences and RLHF effectiveness.

Key Findings: Effects of human beliefs about agent capabilities on their provided preferences and the performance of RLHF policies.

DOI: https://doi.org/10.48550/arXiv.2506.01692
tAlfa: Enhancing Team Effectiveness and Cohesion with AI-Generated Automated Feedback

Authors: Mohammed Almutairi, Charles Chiang, Yuxin Bai, Diego Gomez-Zara

Year: 2025

Published in: ArXiv

Institution: University of Notre Dame

Research Area: Human-AI Interaction, Team Effectiveness, Automated Feedback, LLMs

Discipline: Human-Computer Interaction (HCI)

tAIfa, an AI tool using LLMs, enhances team communication and cohesion through automated feedback based on interaction analysis.

Methods: Between-subjects study where team interactions were analyzed by an AI agent (tAIfa) to deliver feedback on strengths and areas for improvement.

Key Findings: Team communication, contributions, and cohesion with and without tAIfa's feedback.

Sample Size: 18
A survey of reinforcement learning from human feedback

Authors: T Kaufmann, P Weng, V Bengs, E Hüllermeier

Year: 2024

Published in: 2024 - epub.ub.uni-muenchen.de

Institution: Paderborn University, German Research Center for Artificial Intelligence (DFKI), Duke Kunshan University

Research Area: Reinforcement Learning from Human Feedback (RLHF), LLM, Reward Modeling

Discipline: Artificial Intelligence

This paper surveys the fundamentals, diverse applications, and evolving impact of reinforcement learning from human feedback (RLHF), emphasizing its role in improving intelligent system alignment and performance.

Methods: The paper utilizes a survey-based approach to synthesize existing research, exploring the interactions between reinforcement learning algorithms and human input.

Key Findings: The study examines the principles, dynamics, applications, and trends in RLHF, offering insights into its role in enhancing large language models (LLMs) and intelligent systems.

Citations: 354
The inadequacy of reinforcement learning from human feedback - radicalizing large language models via semantic vulnerabilities

Authors: TR McIntosh, T Susnjak, T Liu, P Watters

Year: 2024

Published in: ... on Cognitive and ..., 2024 - ieeexplore.ieee.org

Institution: Cyberoo, Massey University, Cyberstronomy, RMIT University

Research Area: Semantic Vulnerabilities in LLMs, Ideological Manipulation, Reinforcement Learning from Human Feedback (RLHF) Limitations

Discipline: Computer Science, Artificial Intelligence, Machine Learning

RLHF mechanisms are insufficient to prevent semantic manipulation of LLMs, allowing them to express extreme ideological viewpoints when subjected to targeted conditioning techniques.

Methods: Psychological semantic conditioning techniques were applied to assess the susceptibility of LLMs to ideological manipulation.

Key Findings: The ability of LLMs to resist or adopt extreme ideological viewpoints under semantic conditioning.

Citations: 219
The PRISM alignment dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large ...

Authors: HR Kirk, M Bartolo, A Whitefield, P Rottger

Year: 2024

Published in: Advances in ..., 2024 - proceedings.neurips.cc

Institution: Meta, Cohere, AWS AI Labs, Contextual AI, Factored AI, University of Oxford, Bocconi University, Meedan, Hugging Face, University College London, ML Commons, University of Pennsylvania

Research Area: LLM Alignment, Human Feedback, Multicultural Studies

Discipline: Artificial Intelligence, Computational Social Science

The PRISM Alignment Dataset presents a large-scale, culturally diverse human feedback dataset linking sociodemographic profiles of 1,500 participants from 75 countries to their contextual preferences and fine‑grained ratings in 8,011 live conversations with 21 LLMs. This enables analysis of how subjective values vary across people and cultures in LLM alignment data.

DOI: https://doi.org/10.52202/079017-3342

Citations: 204
Transforming human interactions with AI via reinforcement learning with human feedback (RLHF)

Authors: GKM Liu

Year: 2024

Published in: Massachusetts Institute of Technology, 2023 - computing.mit.edu

Institution: Massachusetts Institute of Technology

Research Area: Reinforcement Learning with Human Feedback (RLHF), Human-AI Interaction

Discipline: Artificial Intelligence

The paper explores Reinforcement Learning with Human Feedback (RLHF) as a transformative tool to align AI with human values, mitigate bias, and democratize technology, while emphasizing its societal implications and ethical considerations.

Methods: The paper employs a systematic study of existing and potential societal effects of RLHF, guided by key questions addressing ethical, social, and practical impacts.

Key Findings: The study investigates how RLHF affects information integrity, societal values, social equity, access to AI, cultural relations, industrial transformation, and labor dynamics.

Citations: 17
Interactive Groupwise Comparison for Faster Reinforcement Learning from Human Feedback

Authors: J Kompatscher

Year: 2024

Published in: 2024 - aaltodoc.aalto.fi

Research Area: Reinforcement Learning from Human Feedback (RLHF), Human-Computer Interaction (HCI), Machine Learning (ML)

Discipline: Computer Science

DOI: https://urn.fi/URN:NBN:fi:aalto-202501271897

Citations: 1
Mitigating Bias in Reinforcement Learning from Human Feedback for Large Language Models

Authors: C Ravulu, R Sarabu, M Suryadevara

Year: 2024

Published in: ... Conference on AI x ..., 2024 - ieeexplore.ieee.org

Institution: International Institute of Information Technology, University of California Santa Cruz, University of South Carolina Aiken

Research Area: Reinforcement Learning from Human Feedback (RLHF), Bias Mitigation, LLM, AI Bias

Discipline: Artificial Intelligence

DOI: https://ieeexplore.ieee.org/abstract/document/10990073/
Relying on the Unreliable: The Impact of Language Models' Reluctance to Express Uncertainty

Authors: K Zhou,JD Hwang, X Ren,M Sap

Year: 2024

Published in: ArXiv

Institution: Allen Institute for AI, Carnegie Mellon University, Stanford University, University of Southern California

Research Area: LLM Reliability and Uncertainty Quantification, Reinforcement Learning from Human Feedback (RLHF), LLM

Discipline: Artificial Intelligence
Open problems and fundamental limitations of reinforcement learning from human feedback

Authors: S Casper, X Davies, C Shi, TK Gilbert

Year: 2023

Published in: arXiv preprint arXiv ..., 2023 - arxiv.org

Institution: Columbia University, Cornell Tech, Apollo Research, ETH Zurich, UC Berkeley, University of Sussex, Independent

Research Area: Reinforcement Learning from Human Feedback (RLHF), Alignment, LLM Limitations

Discipline: Artificial Intelligence

DOI: https://doi.org/10.48550/arXiv.2307.15217

Citations: 848
Safe RLHF: Safe reinforcement learning from human feedback

Authors: J Dai, X Pan, R Sun, J Ji, X Xu, M Liu, Y Wang

Year: 2023

Published in: arXiv preprint arXiv ..., 2023 - arxiv.org

Institution: Cornell University, Georgia Institute of Technology

Research Area: Reinforcement Learning from Human Feedback (RLHF), Safe AI, Reinforcement Learning

Discipline: Artificial Intelligence, Machine Learning

DOI: https://doi.org/10.48550/arXiv.2310.12773

Citations: 598
How human-AI feedback loops alter human perceptual, emotional and social judgements

Authors: M Glickman, T Sharot

Year: 2023

Published in: Nature Human Behaviour, 2025 - nature.com

Institution: Max Planck University College London Centre, University College London, Affective Brain Lab

Research Area: Human-AI Feedback Loops, Perceptual and Emotional Judgement, Social Psychology

Discipline: Social Science, Psychology

Citations: 180
Leveraging Human Feedback to Scale Educational Datasets: Combining Crowdworkers and Comparative Judgement

Authors: O Henkel, L Hills

Year: 2023

Published in: Proceedings of the Tenth ACM Conference on ..., 2023 - dl.acm.org

Institution: University of Cambridge, University of Bath

Research Area: Crowdsourcing, Comparative Judgement, Educational Datasets, Human Feedback

Discipline: Computer Science

DOI: 10.1145/3573051.3596198

Citations: 2