A Descriptive and Normative Theory of Human Beliefs in RLHF
Authors: S Dandekar, S Deshmukh, F Chiu, WB Knox
Published: 2025
Publication: arXiv preprint arXiv ..., 2025 - arxiv.org
The paper investigates how human beliefs about agent capabilities influence preferences in RLHF, proposing a model to minimize the mismatch between beliefs and idealized agent capabilities, ultimately improving policy performance.
Methods: Human studies and synthetic experiments to model and test the impact of belief mismatches on human preferences and RLHF effectiveness.
Key Findings: Effects of human beliefs about agent capabilities on their provided preferences and the performance of RLHF policies.
Limitations: Does not address the broader implications of belief mismatches across diverse agent architectures or explore highly complex intervention strategies in detail.
Institution: University of California, Davis, Northwestern University
Research Area: Reinforcement Learning from Human Feedback (RLHF), Human-AI Interaction, AI Theory
Discipline: Artificial Intelligence, Social Science
DOI: https://doi.org/10.48550/arXiv.2506.01692