Replication: How well do my results generalize now? The external validity of online privacy and security surveys
Abstract
Privacy and security researchers often rely on data collected through online crowdsourcing platforms such as Amazon Mechanical Turk (MTurk) and Prolific. Prior work---which used data collected in the United States between 2013 and 2017---found that MTurk responses regarding security and privacy were generally representative for people under 50 or with some college education. However, the landscape of online crowdsourcing has changed significantly over the last five years, with the rise of Prolific as a major platform and the increasing presence of bots. This work attempts to replicate the prior results about the external validity of online privacy and security surveys. We conduct an online survey on MTurk (n=800), a gender-balanced survey on Prolific (n=800), and a representative survey on Prolific (n=800) and compare the responses to a probabilistic survey conducted by the Pew Research Center (n=4272). We find that MTurk response quality has degraded over the last five years, and our results do not replicate the earlier finding about the generalizability of MTurk responses. By contrast, we find that data collected through Prolific is generally representative for questions about user perceptions and experiences, but not for questions about security and privacy knowledge. We also evaluate the impact of Prolific settings, attention check questions, and statistical methods on the external validity of online surveys, and we develop recommendations about best practices for conducting online privacy and security surveys.
Study specs
- Discipline
- Computer Science,Behavioral Science
- Year
- 2022
- Human Data Platform
- Prolific
- Source
- View Source Google Scholar
Peer Review & Critical Discussion
Potential Selection Bias in 2023 Cohort
The participant pool shows a concerning overrepresentation of users from high-income demographics. Looking at Table 3, we can see that 78% of respondents had annual incomes above $75k, which significantly limits the generalizability of these findings to broader populations.
Non-naive Participants Issue
I've noticed a methodological concern regarding participant naivety. Given that Prolific users often complete multiple studies, there's a real risk that participants had prior exposure to similar experimental paradigms, which could confound the results.
RLHF Applicability to This Study Design
The implications for RLHF training pipelines are understated. If we accept the authors' conclusions about preference stability, this has direct consequences for how we should structure reward model training. The temporal decay effect described in Section 4.2 is particularly relevant.
Verify your expertise to join discussion
Create an account and verify your credentials to participate in peer discussions.