Safe RLHF: Safe reinforcement learning from human feedback

Authors: J Dai, X Pan, R Sun, J Ji, X Xu, M Liu, Y Wang

Published: 2023

Publication: arXiv preprint arXiv ..., 2023 - arxiv.org

Research paper: Safe RLHF: Safe reinforcement learning from human feedback

Institution: Cornell University, Georgia Institute of Technology

Research Area: Reinforcement Learning from Human Feedback (RLHF), Safe AI, Reinforcement Learning

Discipline: Artificial Intelligence, Machine Learning

Citations: 598

DOI: https://doi.org/10.48550/arXiv.2310.12773