True or False: Does the Deep Learning Model Learn to Detect Rumors?

Shiwen Ni,Jiawen Li,Hung-Yu Kao
DOI: https://doi.org/10.1109/TAAI54685.2021.00030
2021-12-01
Abstract:It is difficult for humans to distinguish the true and false of rumors, but current deep learning models can surpass humans and achieve excellent accuracy on many rumor datasets. In this paper, we investigate whether deep learning models that seem to perform well actually learn to detect rumors. We evaluate models on their generalization ability to out-of-domain examples by fine-tuning BERT-based models on five real-world datasets and evaluating against all test sets. The experimental results indicate that the generalization ability of the models on other unseen datasets are unsatisfactory, even common-sense rumors cannot be detected. Moreover, we found through experiments that models take shortcuts and learn absurd knowledge when the rumor datasets have serious data pitfalls. This means that simple modifications to the rumor text based on specific rules will lead to inconsistent model predictions. To more realistically evaluate rumor detection models, we proposed a new evaluation method called paired test (PairT), which requires models to correctly predict a pair of test samples at the same time. Furthermore, we make recommendations on how to better create rumor dataset and evaluate rumor detection model at the end of this paper.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: Have the currently well - performing deep - learning models really learned to detect rumors? Specifically, the author explores this topic through the following four sub - questions: 1. **Can the performance on individual rumor datasets be generalized to new datasets?** The author fine - tunes the BERT model on five real - world datasets and evaluates it on all test sets. It is found that the model has poor generalization ability on unseen datasets and cannot even detect common - sense rumors. 2. **Can the model detect common - sense rumors?** The author creates a dataset containing common - sense rumors and finds that the model's performance on such rumors is close to random guessing, indicating that the model has not really learned to detect these simple rumors. 3. **Are the model's prediction results trustworthy and consistent?** Through the analysis of specific cases, the author finds that there is inconsistency in the model's prediction results. For example, the model may consider that "The neighbor's pet dog gave birth to a cat" is true, and at the same time think that "Dogs can only give birth to dogs, and cats can only give birth to cats" is also true, which is obviously unreasonable. 4. **What has the model learned from the rumor datasets?** The author analyzes the words that the model focuses on through the word - level attention mechanism and finds that the model may rely on certain specific cues in the dataset (such as "Obama", "Paul", "Sydney", etc.), rather than truly understanding the text content. This dependence leads to a significant decline in the model's performance on adversarial datasets. Overall, this paper aims to reveal the limitations of current deep - learning models in the rumor - detection task and proposes suggestions for improving datasets and evaluation methods to improve the reliability and generalization ability of the model.