The inadequacy of reinforcement learning from human feedback - radicalizing large language models via semantic vulnerabilities

Authors: TR McIntosh, T Susnjak, T Liu, P Watters

Published: 2024

Publication: ... on Cognitive and ..., 2024 - ieeexplore.ieee.org

RLHF mechanisms are insufficient to prevent semantic manipulation of LLMs, allowing them to express extreme ideological viewpoints when subjected to targeted conditioning techniques.

Methods: Psychological semantic conditioning techniques were applied to assess the susceptibility of LLMs to ideological manipulation.

Key Findings: The ability of LLMs to resist or adopt extreme ideological viewpoints under semantic conditioning.

Limitations: The paper does not explore long-term mitigation strategies or how specific LLM architectures affect vulnerability.

Institution: Cyberoo, Massey University, Cyberstronomy, RMIT University

Research Area: Semantic Vulnerabilities in LLMs, Ideological Manipulation, Reinforcement Learning from Human Feedback (RLHF) Limitations

Discipline: Computer Science, Artificial Intelligence, Machine Learning

Citations: 219