Logically at Factify 2022: Multimodal Fact Verification

Jie Gao,Hella-Franziska Hoffmann,Stylianos Oikonomou,David Kiskovski,Anil Bandhakavi
DOI: https://doi.org/10.48550/arXiv.2112.09253
2022-03-26
Abstract:This paper describes our participant system for the multi-modal fact verification (Factify) challenge at AAAI 2022. Despite the recent advance in text based verification techniques and large pre-trained multimodal models cross vision and language, very limited work has been done in applying multimodal techniques to automate fact checking process, particularly considering the increasing prevalence of claims and fake news about images and videos on social media. In our work, the challenge is treated as multimodal entailment task and framed as multi-class classification. Two baseline approaches are proposed and explored including an ensemble model (combining two uni-modal models) and a multi-modal attention network (modeling the interaction between image and text pair from claim and evidence document). We conduct several experiments investigating and benchmarking different SoTA pre-trained transformers and vision models in this work. Our best model is ranked first in leaderboard which obtains a weighted average F-measure of 0.77 on both validation and test set. Exploratory analysis of dataset is also carried out on the Factify data set and uncovers salient patterns and issues (e.g., word overlapping, visual entailment correlation, source bias) that motivates our hypothesis. Finally, we highlight challenges of the task and multimodal dataset for future research.
Computer Vision and Pattern Recognition,Computation and Language,Multimedia
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges in the field of multimodal fact verification, especially in the context of the increasing amount of false information about images and videos on social media. Specifically, the paper focuses on how to use multimodal techniques to automate the fact - checking process, which has been less explored in existing research. The author regards this challenge as a multimodal entailment task and frames it as a multi - class classification problem. Two baseline methods are proposed in the paper to solve this problem: one is an ensemble model that combines two unimodal models, and the other is a multimodal attention network that models the interaction between image - text pairs. The core of the paper lies in developing algorithms that can effectively capture the semantic consistency and integrity between images and text to address the following specific challenges: 1. **Fine - grained image differences**: Simple image similarity cannot distinguish subtle image differences and performs poorly for adversarial images. 2. **Cross - modal semantic integrity**: It is necessary to not only learn the content features of images and text respectively, but also capture the cross - modal semantic consistency. 3. **Problems in the dataset**: Such as vocabulary overlap, visual entailment correlation, source bias, etc., which are revealed in the exploratory data analysis. By proposing the above methods, the paper aims to improve the accuracy and efficiency of multimodal fact verification, so as to better deal with the spread of false information on social media.