Authors: S Dandekar, S Deshmukh, F Chiu, WB Knox
Year: 2025
Published in: arXiv preprint arXiv ..., 2025 - arxiv.org
Institution: University of California, Davis, Northwestern University
Research Area: Reinforcement Learning from Human Feedback (RLHF), Human-AI Interaction, AI Theory
Discipline: Artificial Intelligence, Social Science
The paper investigates how human beliefs about agent capabilities influence preferences in RLHF, proposing a model to minimize the mismatch between beliefs and idealized agent capabilities, ultimately improving policy performance.
Methods: Human studies and synthetic experiments to model and test the impact of belief mismatches on human preferences and RLHF effectiveness.
Key Findings: Effects of human beliefs about agent capabilities on their provided preferences and the performance of RLHF policies.
DOI: https://doi.org/10.48550/arXiv.2506.01692