Authors: S Hatgis-Kessell, WB Knox, S Booth, S Niekum
Year: 2025
Published in: arXiv preprint arXiv ..., 2025 - arxiv.org
Institution: Stanford University, UMass Amherst, Carnegie Mellon University
Research Area: Reinforcement Learning with Human Feedback (RLHF)
Discipline: Artificial Intelligence, Human-Computer Interaction (HCI)
The paper investigates whether human preferences can be influenced to align more closely with assumed preference models in RLHF algorithms through interventions such as showing model-derived quantities, training on preference models, and modifying elicitation questions.
Methods: Three human studies were conducted where interventions were tested, including revealing model-derived quantities, training participants on a preference model, and altering how preference questions were framed.
Key Findings: Evaluated the impact of interventions on humans' expression of preferences to align better with the assumed preference models of RLHF algorithms.
DOI: https://doi.org/10.48550/arXiv.2501.06416
Citations: 1
Authors: S Dandekar, S Deshmukh, F Chiu, WB Knox
Year: 2025
Published in: arXiv preprint arXiv ..., 2025 - arxiv.org
Institution: University of California, Davis, Northwestern University
Research Area: Reinforcement Learning from Human Feedback (RLHF), Human-AI Interaction, AI Theory
Discipline: Artificial Intelligence, Social Science
The paper investigates how human beliefs about agent capabilities influence preferences in RLHF, proposing a model to minimize the mismatch between beliefs and idealized agent capabilities, ultimately improving policy performance.
Methods: Human studies and synthetic experiments to model and test the impact of belief mismatches on human preferences and RLHF effectiveness.
Key Findings: Effects of human beliefs about agent capabilities on their provided preferences and the performance of RLHF policies.
DOI: https://doi.org/10.48550/arXiv.2506.01692