A Multi-Label Dataset of French Fake News: Human and Machine Insights

Benjamin Icard,François Maine,Morgane Casanova,Géraud Faye,Julien Chanson,Guillaume Gadek,Ghislain Atemezing,François Bancilhon,Paul Égré
2024-04-11
Abstract:We present a corpus of 100 documents, OBSINFOX, selected from 17 sources of French press considered unreliable by expert agencies, annotated using 11 labels by 8 annotators. By collecting more labels than usual, by more annotators than is typically done, we can identify features that humans consider as characteristic of fake news, and compare them to the predictions of automated classifiers. We present a topic and genre analysis using Gate Cloud, indicative of the prevalence of satire-like text in the corpus. We then use the subjectivity analyzer VAGO, and a neural version of it, to clarify the link between ascriptions of the label Subjective and ascriptions of the label Fake News. The annotated dataset is available online at the following url:
Machine Learning
What problem does this paper attempt to address?
This paper attempts to address the multidimensional complexity in fake news detection. Specifically: 1. **Multidimensional Nature of Fake News**: Fake news is not just simple false information; it also includes satire, misinformation, and biased or partisan information. Existing fake news detection algorithms typically rely on two labels (such as "biased" and "legitimate"), which cannot fully reflect the complexity of fake news. 2. **Accuracy and Richness of Datasets**: To improve the reliability of fake news detectors, it is necessary to use sufficiently accurate datasets and labels to cover multiple dimensions. Although multi-label fake news datasets exist, they usually do not include stylistic information or have a limited number of labels. 3. **Comparison of Human Annotation and Machine Prediction**: By collecting more labels and more annotators, it is possible to identify the characteristics of fake news as perceived by humans and compare them with the predictions of automated classifiers, thereby better understanding the mechanisms of fake news identification. To address these issues, the authors constructed a French multi-label fake news dataset named OBSINFOX, containing 100 articles from sources considered unreliable. These articles were annotated by 8 annotators using 11 labels, aiming to identify which labels best reflect the state of the text and to find the best clues in these texts classified as fake news by both humans and machines.