Influencing Humans to Conform to Preference Models for RLHF

Authors: S Hatgis-Kessell, WB Knox, S Booth, S Niekum

Published: 2025

Publication: arXiv preprint arXiv ..., 2025 - arxiv.org

The paper investigates whether human preferences can be influenced to align more closely with assumed preference models in RLHF algorithms through interventions such as showing model-derived quantities, training on preference models, and modifying elicitation questions.

Methods: Three human studies were conducted where interventions were tested, including revealing model-derived quantities, training participants on a preference model, and altering how preference questions were framed.

Key Findings: Evaluated the impact of interventions on humans' expression of preferences to align better with the assumed preference models of RLHF algorithms.

Institution: Stanford University, UMass Amherst, Carnegie Mellon University

Research Area: Reinforcement Learning with Human Feedback (RLHF)

Discipline: Artificial Intelligence, Human-Computer Interaction (HCI)

Citations: 1

DOI: https://doi.org/10.48550/arXiv.2501.06416