Back to Library|More studies on Machine Learning, or Artificial Intelligence

M4FC: a Multimodal, Multilingual, Multicultural, Multitask Real-World Fact-Checking Dataset

3 citations

2026

Abstract

Existing real-world datasets for multimodal fact-checking have multiple limitations: they contain few instances, cover on only one or two languages, focus only on one task, or rely on external news article sets for sourcing true claims. To address these shortcomings, we introduce M4FC, a new real-world dataset comprising 4,982 images paired with 6,980 claims. The images, verified by professional fact-checkers from 22 organizations, represent a diverse range of cultural and geographic contexts. Each claim is available in one or two out of ten languages. M4FC spans six multimodal fact-checking tasks: visual claim extraction, claimant intent prediction, fake image detection, image contextualization, location verification, and verdict prediction. We provide baseline results for all tasks and analyze how combining intermediate tasks affects verdict prediction performance. We make our dataset and code publicly available.

Citations

Research

Dataset

Relevant for

Has Dataset

N > 2,000

Study specs

The dataset was created by pairing 4,982 images with 6,980 claims, which were verified by professional fact-checkers from 22 organizations covering diverse cultural and geographic contexts. The claims are available in up to ten languages and span six different multimodal fact-checking tasks.

Authors: J Geng,J Tonglet,I Gurevych
Institution: KU Leuven,TU Darmstadt,Ubiquitous Knowledge Processing Lab,MBZUAI,ATHENE
Discipline: Machine Learning,Artificial Intelligence
Sample Size: N=6,980
Study Type: dataset
Year: 2026
Human Data Platform: Prolific
Source: View Source Google Scholar

Measured Outcomes

The study measured the efficacy of the M4FC dataset across six multimodal fact-checking tasks, with a focus on how combining intermediate tasks affects the performance of verdict prediction.

Peer Review & Critical Discussion

3 threads

Potential Selection Bias in 2023 Cohort

DSJDr. Sarah J.

Verified PhD Candidate

12 replies

The participant pool shows a concerning overrepresentation of users from high-income demographics. Looking at Table 3, we can see that 78% of respondents had annual incomes above $75k, which significantly limits the generalizability of these findings to broader populations.

2 hours ago

Non-naive Participants Issue

MCM. Chen (OpenAI)

Data Scientist

8 replies

I've noticed a methodological concern regarding participant naivety. Given that Prolific users often complete multiple studies, there's a real risk that participants had prior exposure to similar experimental paradigms, which could confound the results.

5 hours ago

RLHF Applicability to This Study Design

PRWProf. R. Williams

Verified Researcher

15 replies

The implications for RLHF training pipelines are understated. If we accept the authors' conclusions about preference stability, this has direct consequences for how we should structure reward model training. The temporal decay effect described in Section 4.2 is particularly relevant.

1 day ago

Verify your expertise to join discussion

Create an account and verify your credentials to participate in peer discussions.

Citations

Dataset Available

Read Paper Take part in research Run research

Bookmark on Reddit

Take part Read Paper