Discover 17 peer-reviewed studies in Human Feedback (2023–2025). Explore research findings powered by Prolific's diverse participant panel.
This page lists 17 peer-reviewed papers in the research area of Human Feedback in the Prolific Citations Library, a curated collection of research powered by high-quality human data from Prolific.
-
Authors: S Chaudhari, P Aggarwal, V Murahari
Year: 2025
Published in: ACM Computing ..., 2025 - dl.acm.org
Institution: University of Massachusetts Amherst, Carnegie Mellon University, Princeton University
Research Area: Reinforcement Learning from Human Feedback (RLHF), LLM, RLHF
Discipline: Artificial Intelligence
The paper critically analyzes reinforcement learning from human feedback (RLHF) for large language models (LLMs), emphasizing the importance and limitations of reward models in improving human-aligned AI systems.
Methods: Analyzed RLHF frameworks through reinforcement learning principles; conducted a categorical literature review to identify modeling challenges, assumptions, and framework limitations.
Key Findings: Investigated RLHF's fundamentals, focusing on the role of reward models, implications of design choices in RLHF training algorithms, and underlying issues like generalization errors, model misspecification, and feedback sparsity.
Citations: 117
-
Authors: A Dahlgren Lindström, L Methnani, L Krause
Year: 2025
Published in: Ethics and Information ..., 2025 - Springer
Institution: Umeå University, Vrije Universiteit Amsterdam
Research Area: AI Alignment, AI Safety, Reinforcement Learning from Human Feedback (RLHF), Sociotechnical Systems
Discipline: Artificial Intelligence, Ethics
The paper critiques AI alignment efforts using RLHF and RLAIF, highlighting theoretical and practical limitations in meeting the goals of helpfulness, harmlessness, and honesty, and advocates for a broader sociotechnical approach to AI safety and ethics.
Methods: Sociotechnical critique of RLHF techniques with an analysis of theoretical frameworks and practical implementations.
Key Findings: The alignment of AI systems with human values and the efficacy of RLHF techniques in achieving the HHH principle (helpfulness, harmlessness, honesty).
DOI: https://doi.org/10.1007/s10676-025-09837-2
Citations: 14
-
Authors: S Lodoen, A Orchard
Year: 2025
Published in: arXiv preprint arXiv:2505.09576, 2025 - arxiv.org
Institution: Embry–Riddle Aeronautical University, University of Waterloo
Research Area: Reinforcement Learning from Human Feedback (RLHF), Procedural Rhetoric, LLM Persuasion, Ethics
Discipline: Artificial Intelligence, AI Ethics, Social Science
The paper uses procedural rhetoric to analyze how RLHF reshapes ethical, social, and rhetorical dimensions of generative AI interactions, raising concerns about biases, hegemonic language, and human relationships.
Methods: The study conducts a theoretical and rhetorical analysis based on Ian Bogost's concept of procedural rhetoric, examining how RLHF mechanisms influence language conventions, information practices, and social expectations.
Key Findings: Ethical and rhetorical implications of RLHF-enhanced LLMs on language usage, information seeking, and interpersonal dynamics.
DOI: https://doi.org/10.48550/arXiv.2505.09576
Citations: 3
-
Authors: S Hatgis-Kessell, WB Knox, S Booth, S Niekum
Year: 2025
Published in: arXiv preprint arXiv ..., 2025 - arxiv.org
Institution: Stanford University, UMass Amherst, Carnegie Mellon University
Research Area: Reinforcement Learning with Human Feedback (RLHF)
Discipline: Artificial Intelligence, Human-Computer Interaction (HCI)
The paper investigates whether human preferences can be influenced to align more closely with assumed preference models in RLHF algorithms through interventions such as showing model-derived quantities, training on preference models, and modifying elicitation questions.
Methods: Three human studies were conducted where interventions were tested, including revealing model-derived quantities, training participants on a preference model, and altering how preference questions were framed.
Key Findings: Evaluated the impact of interventions on humans' expression of preferences to align better with the assumed preference models of RLHF algorithms.
DOI: https://doi.org/10.48550/arXiv.2501.06416
Citations: 1
-
Authors: S Dandekar, S Deshmukh, F Chiu, WB Knox
Year: 2025
Published in: arXiv preprint arXiv ..., 2025 - arxiv.org
Institution: University of California, Davis, Northwestern University
Research Area: Reinforcement Learning from Human Feedback (RLHF), Human-AI Interaction, AI Theory
Discipline: Artificial Intelligence, Social Science
The paper investigates how human beliefs about agent capabilities influence preferences in RLHF, proposing a model to minimize the mismatch between beliefs and idealized agent capabilities, ultimately improving policy performance.
Methods: Human studies and synthetic experiments to model and test the impact of belief mismatches on human preferences and RLHF effectiveness.
Key Findings: Effects of human beliefs about agent capabilities on their provided preferences and the performance of RLHF policies.
DOI: https://doi.org/10.48550/arXiv.2506.01692
-
Authors: Mohammed Almutairi, Charles Chiang, Yuxin Bai, Diego Gomez-Zara
Year: 2025
Published in: ArXiv
Institution: University of Notre Dame
Research Area: Human-AI Interaction, Team Effectiveness, Automated Feedback, LLMs
Discipline: Human-Computer Interaction (HCI)
tAIfa, an AI tool using LLMs, enhances team communication and cohesion through automated feedback based on interaction analysis.
Methods: Between-subjects study where team interactions were analyzed by an AI agent (tAIfa) to deliver feedback on strengths and areas for improvement.
Key Findings: Team communication, contributions, and cohesion with and without tAIfa's feedback.
Sample Size: 18
-
Authors: T Kaufmann, P Weng, V Bengs, E Hüllermeier
Year: 2024
Published in: 2024 - epub.ub.uni-muenchen.de
Institution: Paderborn University, German Research Center for Artificial Intelligence (DFKI), Duke Kunshan University
Research Area: Reinforcement Learning from Human Feedback (RLHF), LLM, Reward Modeling
Discipline: Artificial Intelligence
This paper surveys the fundamentals, diverse applications, and evolving impact of reinforcement learning from human feedback (RLHF), emphasizing its role in improving intelligent system alignment and performance.
Methods: The paper utilizes a survey-based approach to synthesize existing research, exploring the interactions between reinforcement learning algorithms and human input.
Key Findings: The study examines the principles, dynamics, applications, and trends in RLHF, offering insights into its role in enhancing large language models (LLMs) and intelligent systems.
Citations: 354
-
Authors: TR McIntosh, T Susnjak, T Liu, P Watters
Year: 2024
Published in: ... on Cognitive and ..., 2024 - ieeexplore.ieee.org
Institution: Cyberoo, Massey University, Cyberstronomy, RMIT University
Research Area: Semantic Vulnerabilities in LLMs, Ideological Manipulation, Reinforcement Learning from Human Feedback (RLHF) Limitations
Discipline: Computer Science, Artificial Intelligence, Machine Learning
RLHF mechanisms are insufficient to prevent semantic manipulation of LLMs, allowing them to express extreme ideological viewpoints when subjected to targeted conditioning techniques.
Methods: Psychological semantic conditioning techniques were applied to assess the susceptibility of LLMs to ideological manipulation.
Key Findings: The ability of LLMs to resist or adopt extreme ideological viewpoints under semantic conditioning.
Citations: 219
-
Authors: HR Kirk, M Bartolo, A Whitefield, P Rottger
Year: 2024
Published in: Advances in ..., 2024 - proceedings.neurips.cc
Institution: Meta, Cohere, AWS AI Labs, Contextual AI, Factored AI, University of Oxford, Bocconi University, Meedan, Hugging Face, University College London, ML Commons, University of Pennsylvania
Research Area: LLM Alignment, Human Feedback, Multicultural Studies
Discipline: Artificial Intelligence, Computational Social Science
The PRISM Alignment Dataset presents a large-scale, culturally diverse human feedback dataset linking sociodemographic profiles of 1,500 participants from 75 countries to their contextual preferences and fine‑grained ratings in 8,011 live conversations with 21 LLMs. This enables analysis of how subjective values vary across people and cultures in LLM alignment data.
DOI: https://doi.org/10.52202/079017-3342
Citations: 204
-
Authors: GKM Liu
Year: 2024
Published in: Massachusetts Institute of Technology, 2023 - computing.mit.edu
Institution: Massachusetts Institute of Technology
Research Area: Reinforcement Learning with Human Feedback (RLHF), Human-AI Interaction
Discipline: Artificial Intelligence
The paper explores Reinforcement Learning with Human Feedback (RLHF) as a transformative tool to align AI with human values, mitigate bias, and democratize technology, while emphasizing its societal implications and ethical considerations.
Methods: The paper employs a systematic study of existing and potential societal effects of RLHF, guided by key questions addressing ethical, social, and practical impacts.
Key Findings: The study investigates how RLHF affects information integrity, societal values, social equity, access to AI, cultural relations, industrial transformation, and labor dynamics.
Citations: 17
-
Authors: J Kompatscher
Year: 2024
Published in: 2024 - aaltodoc.aalto.fi
Research Area: Reinforcement Learning from Human Feedback (RLHF), Human-Computer Interaction (HCI), Machine Learning (ML)
Discipline: Computer Science
DOI: https://urn.fi/URN:NBN:fi:aalto-202501271897
Citations: 1
-
Authors: C Ravulu, R Sarabu, M Suryadevara
Year: 2024
Published in: ... Conference on AI x ..., 2024 - ieeexplore.ieee.org
Institution: International Institute of Information Technology, University of California Santa Cruz, University of South Carolina Aiken
Research Area: Reinforcement Learning from Human Feedback (RLHF), Bias Mitigation, LLM, AI Bias
Discipline: Artificial Intelligence
DOI: https://ieeexplore.ieee.org/abstract/document/10990073/
-
Authors: K Zhou,JD Hwang, X Ren,M Sap
Year: 2024
Published in: ArXiv
Institution: Allen Institute for AI, Carnegie Mellon University, Stanford University, University of Southern California
Research Area: LLM Reliability and Uncertainty Quantification, Reinforcement Learning from Human Feedback (RLHF), LLM
Discipline: Artificial Intelligence
-
Authors: S Casper, X Davies, C Shi, TK Gilbert
Year: 2023
Published in: arXiv preprint arXiv ..., 2023 - arxiv.org
Institution: Columbia University, Cornell Tech, Apollo Research, ETH Zurich, UC Berkeley, University of Sussex, Independent
Research Area: Reinforcement Learning from Human Feedback (RLHF), Alignment, LLM Limitations
Discipline: Artificial Intelligence
DOI: https://doi.org/10.48550/arXiv.2307.15217
Citations: 848
-
Authors: J Dai, X Pan, R Sun, J Ji, X Xu, M Liu, Y Wang
Year: 2023
Published in: arXiv preprint arXiv ..., 2023 - arxiv.org
Institution: Cornell University, Georgia Institute of Technology
Research Area: Reinforcement Learning from Human Feedback (RLHF), Safe AI, Reinforcement Learning
Discipline: Artificial Intelligence, Machine Learning
DOI: https://doi.org/10.48550/arXiv.2310.12773
Citations: 598
-
Authors: M Glickman, T Sharot
Year: 2023
Published in: Nature Human Behaviour, 2025 - nature.com
Institution: Max Planck University College London Centre, University College London, Affective Brain Lab
Research Area: Human-AI Feedback Loops, Perceptual and Emotional Judgement, Social Psychology
Discipline: Social Science, Psychology
Citations: 180
-
Authors: O Henkel, L Hills
Year: 2023
Published in: Proceedings of the Tenth ACM Conference on ..., 2023 - dl.acm.org
Institution: University of Cambridge, University of Bath
Research Area: Crowdsourcing, Comparative Judgement, Educational Datasets, Human Feedback
Discipline: Computer Science
DOI: 10.1145/3573051.3596198
Citations: 2