M4FC: a Multimodal, Multilingual, Multicultural, Multitask Real-World Fact-Checking Dataset
Abstract
Existing real-world datasets for multimodal fact-checking have multiple limitations: they contain few instances, cover on only one or two languages, focus only on one task, or rely on external news article sets for sourcing true claims. To address these shortcomings, we introduce M4FC, a new real-world dataset comprising 4,982 images paired with 6,980 claims. The images, verified by professional fact-checkers from 22 organizations, represent a diverse range of cultural and geographic contexts. Each claim is available in one or two out of ten languages. M4FC spans six multimodal fact-checking tasks: visual claim extraction, claimant intent prediction, fake image detection, image contextualization, location verification, and verdict prediction. We provide baseline results for all tasks and analyze how combining intermediate tasks affects verdict prediction performance. We make our dataset and code publicly available.
Study specs
The dataset was created by pairing 4,982 images with 6,980 claims, which were verified by professional fact-checkers from 22 organizations covering diverse cultural and geographic contexts. The claims are available in up to ten languages and span six different multimodal fact-checking tasks.
- Authors
- J Geng,J Tonglet,I Gurevych
- Discipline
- Machine Learning,Artificial Intelligence
- Sample Size
- N=6,980
- Study Type
- dataset
- Year
- 2026
- Human Data Platform
- Prolific
- Source
- View Source Google Scholar
Measured Outcomes
The study measured the efficacy of the M4FC dataset across six multimodal fact-checking tasks, with a focus on how combining intermediate tasks affects the performance of verdict prediction.
Peer Review & Critical Discussion
Potential Selection Bias in 2023 Cohort
The participant pool shows a concerning overrepresentation of users from high-income demographics. Looking at Table 3, we can see that 78% of respondents had annual incomes above $75k, which significantly limits the generalizability of these findings to broader populations.
Non-naive Participants Issue
I've noticed a methodological concern regarding participant naivety. Given that Prolific users often complete multiple studies, there's a real risk that participants had prior exposure to similar experimental paradigms, which could confound the results.
RLHF Applicability to This Study Design
The implications for RLHF training pipelines are understated. If we accept the authors' conclusions about preference stability, this has direct consequences for how we should structure reward model training. The temporal decay effect described in Section 4.2 is particularly relevant.
Verify your expertise to join discussion
Create an account and verify your credentials to participate in peer discussions.