The inadequacy of reinforcement learning from human feedback - radicalizing large language models via semantic vulnerabilities
Authors: TR McIntosh, T Susnjak, T Liu, P Watters
Published: 2024
Publication: ... on Cognitive and ..., 2024 - ieeexplore.ieee.org
RLHF mechanisms are insufficient to prevent semantic manipulation of LLMs, allowing them to express extreme ideological viewpoints when subjected to targeted conditioning techniques.
Methods: Psychological semantic conditioning techniques were applied to assess the susceptibility of LLMs to ideological manipulation.
Key Findings: The ability of LLMs to resist or adopt extreme ideological viewpoints under semantic conditioning.
Limitations: The paper does not explore long-term mitigation strategies or how specific LLM architectures affect vulnerability.
Institution: Cyberoo, Massey University, Cyberstronomy, RMIT University
Research Area: Semantic Vulnerabilities in LLMs, Ideological Manipulation, Reinforcement Learning from Human Feedback (RLHF) Limitations
Discipline: Computer Science, Artificial Intelligence, Machine Learning
Citations: 219