Helpful, harmless, honest? Sociotechnical limits of AI alignment and safety through Reinforcement Learning from Human Feedback: AD Lindström et al.
Authors: A Dahlgren Lindström, L Methnani, L Krause
Published: 2025
Publication: Ethics and Information ..., 2025 - Springer
The paper critiques AI alignment efforts using RLHF and RLAIF, highlighting theoretical and practical limitations in meeting the goals of helpfulness, harmlessness, and honesty, and advocates for a broader sociotechnical approach to AI safety and ethics.
Methods: Sociotechnical critique of RLHF techniques with an analysis of theoretical frameworks and practical implementations.
Key Findings: The alignment of AI systems with human values and the efficacy of RLHF techniques in achieving the HHH principle (helpfulness, harmlessness, honesty).
Limitations: RLHF methods fail to capture complexities of human ethics; the trade-offs between user-friendliness and deception, flexibility and interpretability, and system safety are often overlooked.
Institution: Umeå University, Vrije Universiteit Amsterdam
Research Area: AI Alignment, AI Safety, Reinforcement Learning from Human Feedback (RLHF), Sociotechnical Systems
Discipline: Artificial Intelligence , Ethics
Citations: 14
DOI: https://doi.org/10.1007/s10676-025-09837-2