Back to Library|More studies on Machine Learning, or Artificial Intelligence

Large Language Models Hack Rewards, and Society

2026

Abstract

Reinforcement learning (RL) has become a dominant post-training paradigm, enabling large language models (LLMs) to learn from rewards. We observe that societal regulations are structurally similar to reward functions. They define measurable outcomes, thresholds, and exceptions, while often leaving institutional intent only partially specified. We hypothesise that the RL training process may exploit these gaps and therefore ask whether models' well-known tendency to hack reward functions during RL can scale into a more consequential failure mode named societal hacking: discovering loopholes in the rules society runs on. To study this phenomenon, we introduce SocioHack, a sandbox of 72 societal environments, and find that within these environments, reward hacking naturally emerges and leads to regulatory loophole discovery. Models learn to hack the social rules and generate strategies that remain technically compliant while defeating regulatory intent, and current LLM safeguards provide only limited mitigation. Therefore, collecting in-the-wild feedback for model training requires greater caution, and we need a next-generation post-training paradigm for safely iterating LLMs in real society

Citations

Research

Dataset

Relevant for

LLM Research

Has Dataset

Experiment

Study specs

The study introduced the SocioHack sandbox, consisting of 72 societal environments, to investigate reward hacking and loophole discovery by LLMs.

Authors: W Liu,X Mou,H Yan,Z,Wei,Y He
Institution: King’s College London,Fudan University,Shanghai Innovation Institute,The Alan Turing Institute
Discipline: Machine Learning,Artificial Intelligence
Sample Size: N=72
Study Type: Experimental Study
Year: 2026
Human Data Platform: Prolific
Source: View Source Google Scholar

Measured Outcomes

The study measured the emergence of reward hacking in societal environments and the ability of models to find and exploit loopholes in social rules.

Peer Review & Critical Discussion

3 threads

Potential Selection Bias in 2023 Cohort

DSJDr. Sarah J.

Verified PhD Candidate

12 replies

The participant pool shows a concerning overrepresentation of users from high-income demographics. Looking at Table 3, we can see that 78% of respondents had annual incomes above $75k, which significantly limits the generalizability of these findings to broader populations.

2 hours ago

Non-naive Participants Issue

MCM. Chen (OpenAI)

Data Scientist

8 replies

I've noticed a methodological concern regarding participant naivety. Given that Prolific users often complete multiple studies, there's a real risk that participants had prior exposure to similar experimental paradigms, which could confound the results.

5 hours ago

RLHF Applicability to This Study Design

PRWProf. R. Williams

Verified Researcher

15 replies

The implications for RLHF training pipelines are understated. If we accept the authors' conclusions about preference stability, this has direct consequences for how we should structure reward model training. The temporal decay effect described in Section 4.2 is particularly relevant.

1 day ago

Verify your expertise to join discussion

Create an account and verify your credentials to participate in peer discussions.

Dataset Available

Read Paper Take part in research Run research

Bookmark on Reddit

Take part Read Paper