Better Be Computer or I'm Dumb": A Large-Scale Evaluation of Humans as Audio Deepfake Detectors
Abstract
Audio deepfakes represent a rising threat to trust in our daily communications. In response to this, the research community has developed a wide array of detection techniques aimed at preventing such attacks from deceiving users. Unfortunately, the creation of these defenses has generally overlooked the most important element of the system - the user themselves. As such, it is not clear whether current mechanisms augment, hinder, or simply contradict human classification of deepfakes. In this paper, we perform the first large-scale user study on deepfake detection. We recruit over 1,200 users and present them with samples from the three most widely-cited deepfake datasets. We then quantitatively compare performance and qualitatively conduct thematic analysis to motivate and understand the reasoning behind user decisions and differences from machine classifications. Our results show that users correctly classify human audio at significantly higher rates than machine learning models, and rely on linguistic features and intuition when performing classification. However, users are also regularly misled by pre-conceptions about the capabilities of generated audio (e.g., that accents and background sounds are indicative of humans). Finally, machine learning models suffer from significantly higher false positive rates, and experience false negatives that humans correctly classify when issues of quality or robotic characteristics are reported. By analyzing user behavior across multiple deepfake datasets, our study demonstrates the need to more tightly compare user and machine learning performance, and to target the latter towards areas where humans are less likely to successfully identify threats.
Study specs
A large-scale user study was conducted where over 1,200 participants evaluated audio samples from three widely-cited deepfake datasets. Performance was quantitatively measured and thematic analysis was used to explore user reasoning and differences from machine classification.
- Authors
- K Warren,T Tucker,A Crowder,D Olszewski
- Institution
- University of Florida
- Discipline
- Computer Science
- Sample Size
- N=1,200
- Study Type
- Evaluation Study
- Year
- 2024
- Human Data Platform
- Prolific
- Source
- View Source DOI Google Scholar
Measured Outcomes
Comparison of human and machine classification performance on audio deepfake detection, analysis of user reasoning, and evaluation of error patterns between both humans and models.
Peer Review & Critical Discussion
Potential Selection Bias in 2023 Cohort
The participant pool shows a concerning overrepresentation of users from high-income demographics. Looking at Table 3, we can see that 78% of respondents had annual incomes above $75k, which significantly limits the generalizability of these findings to broader populations.
Non-naive Participants Issue
I've noticed a methodological concern regarding participant naivety. Given that Prolific users often complete multiple studies, there's a real risk that participants had prior exposure to similar experimental paradigms, which could confound the results.
RLHF Applicability to This Study Design
The implications for RLHF training pipelines are understated. If we accept the authors' conclusions about preference stability, this has direct consequences for how we should structure reward model training. The temporal decay effect described in Section 4.2 is particularly relevant.
Verify your expertise to join discussion
Create an account and verify your credentials to participate in peer discussions.