Large Language Models Pass the Turing Test
Abstract
We evaluated 4 systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) in two randomised, controlled, and pre-registered Turing tests on independent populations. Participants had 5 minute conversations simultaneously with another human participant and one of these systems before judging which conversational partner they thought was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant. LLaMa-3.1, with the same prompt, was judged to be the human 56% of the time -- not significantly more or less often than the humans they were being compared to -- while baseline models (ELIZA and GPT-4o) achieved win rates significantly below chance (23% and 21% respectively). The results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test. The results have implications for debates about what kind of intelligence is exhibited by Large Language Models (LLMs), and the social and economic impacts these systems are likely to have.
Study specs
Randomised, controlled, pre-registered Turing Test where 5-minute conversations were conducted between human participants and AI systems, followed by judgments on which partner was human.
- Institution
- University of California San Diego
- Discipline
- Artificial Intelligence
- Study Type
- Experimental Study
- Year
- 2025
- Human Data Platform
- Prolific
- Source
- View Source Google Scholar
Measured Outcomes
The ability of AI systems (ELIZA, GPT-4o, LLaMa-3.1-405B, GPT-4.5) to mimic human conversational behavior and be perceived as human.
Peer Review & Critical Discussion
Potential Selection Bias in 2023 Cohort
The participant pool shows a concerning overrepresentation of users from high-income demographics. Looking at Table 3, we can see that 78% of respondents had annual incomes above $75k, which significantly limits the generalizability of these findings to broader populations.
Non-naive Participants Issue
I've noticed a methodological concern regarding participant naivety. Given that Prolific users often complete multiple studies, there's a real risk that participants had prior exposure to similar experimental paradigms, which could confound the results.
RLHF Applicability to This Study Design
The implications for RLHF training pipelines are understated. If we accept the authors' conclusions about preference stability, this has direct consequences for how we should structure reward model training. The temporal decay effect described in Section 4.2 is particularly relevant.
Verify your expertise to join discussion
Create an account and verify your credentials to participate in peer discussions.