Influencing Humans to Conform to Preference Models for RLHF
Authors: S Hatgis-Kessell, WB Knox, S Booth, S Niekum
Published: 2025
Publication: arXiv preprint arXiv ..., 2025 - arxiv.org
The paper investigates whether human preferences can be influenced to align more closely with assumed preference models in RLHF algorithms through interventions such as showing model-derived quantities, training on preference models, and modifying elicitation questions.
Methods: Three human studies were conducted where interventions were tested, including revealing model-derived quantities, training participants on a preference model, and altering how preference questions were framed.
Key Findings: Evaluated the impact of interventions on humans' expression of preferences to align better with the assumed preference models of RLHF algorithms.
Institution: Stanford University, UMass Amherst, Carnegie Mellon University
Research Area: Reinforcement Learning with Human Feedback (RLHF)
Discipline: Artificial Intelligence, Human-Computer Interaction (HCI)
Citations: 1
DOI: https://doi.org/10.48550/arXiv.2501.06416