Authors: S Hatgis-Kessell, WB Knox, S Booth, S Niekum
Year: 2025
Published in: arXiv preprint arXiv ..., 2025 - arxiv.org
Institution: Stanford University, UMass Amherst, Carnegie Mellon University
Research Area: Reinforcement Learning with Human Feedback (RLHF)
Discipline: Artificial Intelligence, Human-Computer Interaction (HCI)
The paper investigates whether human preferences can be influenced to align more closely with assumed preference models in RLHF algorithms through interventions such as showing model-derived quantities, training on preference models, and modifying elicitation questions.
Methods: Three human studies were conducted where interventions were tested, including revealing model-derived quantities, training participants on a preference model, and altering how preference questions were framed.
Key Findings: Evaluated the impact of interventions on humans' expression of preferences to align better with the assumed preference models of RLHF algorithms.
DOI: https://doi.org/10.48550/arXiv.2501.06416
Citations: 1
Authors: GKM Liu
Year: 2024
Published in: Massachusetts Institute of Technology, 2023 - computing.mit.edu
Institution: Massachusetts Institute of Technology
Research Area: Reinforcement Learning with Human Feedback (RLHF), Human-AI Interaction
Discipline: Artificial Intelligence
The paper explores Reinforcement Learning with Human Feedback (RLHF) as a transformative tool to align AI with human values, mitigate bias, and democratize technology, while emphasizing its societal implications and ethical considerations.
Methods: The paper employs a systematic study of existing and potential societal effects of RLHF, guided by key questions addressing ethical, social, and practical impacts.
Key Findings: The study investigates how RLHF affects information integrity, societal values, social equity, access to AI, cultural relations, industrial transformation, and labor dynamics.
Citations: 17