Wb Knox: Researcher — Prolific Citations Library

Explore 2 peer-reviewed studies by Wb Knox in Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning from Human Feedback (RLHF) (2025). Discover research powered by Prolific's participant panel.

This page lists 2 peer-reviewed papers authored or co-authored by Wb Knox in the Prolific Citations Library, a curated collection of research powered by high-quality human data from Prolific.

Papers (2 of 2)

Influencing Humans to Conform to Preference Models for RLHF

Authors: S Hatgis-Kessell, WB Knox, S Booth, S Niekum

Year: 2025

Published in: arXiv preprint arXiv ..., 2025 - arxiv.org

Institution: Stanford University, UMass Amherst, Carnegie Mellon University

Research Area: Reinforcement Learning with Human Feedback (RLHF)

Discipline: Artificial Intelligence, Human-Computer Interaction (HCI)

The paper investigates whether human preferences can be influenced to align more closely with assumed preference models in RLHF algorithms through interventions such as showing model-derived quantities, training on preference models, and modifying elicitation questions.

Methods: Three human studies were conducted where interventions were tested, including revealing model-derived quantities, training participants on a preference model, and altering how preference questions were framed.

Key Findings: Evaluated the impact of interventions on humans' expression of preferences to align better with the assumed preference models of RLHF algorithms.

DOI: https://doi.org/10.48550/arXiv.2501.06416

Citations: 1
A Descriptive and Normative Theory of Human Beliefs in RLHF

Authors: S Dandekar, S Deshmukh, F Chiu, WB Knox

Year: 2025

Published in: arXiv preprint arXiv ..., 2025 - arxiv.org

Institution: University of California, Davis, Northwestern University

Research Area: Reinforcement Learning from Human Feedback (RLHF), Human-AI Interaction, AI Theory

Discipline: Artificial Intelligence, Social Science

The paper investigates how human beliefs about agent capabilities influence preferences in RLHF, proposing a model to minimize the mismatch between beliefs and idealized agent capabilities, ultimately improving policy performance.

Methods: Human studies and synthetic experiments to model and test the impact of belief mismatches on human preferences and RLHF effectiveness.

Key Findings: Effects of human beliefs about agent capabilities on their provided preferences and the performance of RLHF policies.

DOI: https://doi.org/10.48550/arXiv.2506.01692