Bag of Lies: Robustness in Continuous Pre-training BERT

Ine Gevers,Walter Daelemans

2024-06-14

Abstract:This study aims to acquire more insights into the continuous pre-training phase of BERT regarding entity knowledge, using the COVID-19 pandemic as a case study. Since the pandemic emerged after the last update of BERT's pre-training data, the model has little to no entity knowledge about COVID-19. Using continuous pre-training, we control what entity knowledge is available to the model. We compare the baseline BERT model with the further pre-trained variants on the fact-checking benchmark Check-COVID. To test the robustness of continuous pre-training, we experiment with several adversarial methods to manipulate the input data, such as training on misinformation and shuffling the word order until the input becomes nonsensical. Surprisingly, our findings reveal that these methods do not degrade, and sometimes even improve, the model's downstream performance. This suggests that continuous pre-training of BERT is robust against misinformation. Furthermore, we are releasing a new dataset, consisting of original texts from academic publications in the LitCovid repository and their AI-generated false counterparts.

Computation and Language

What problem does this paper attempt to address?

The paper aims to explore the impact of Continuous Pre-Training (CPT) on the acquisition of entity knowledge and robustness of the BERT model, particularly concerning new entities not included in the original BERT pre-training data (e.g., COVID-19). Specifically, the study achieves this goal through the following points: 1. **Importance of Entity Knowledge**: The paper attempts to understand whether BERT can utilize newly introduced entity knowledge for fact verification tasks and hypothesizes that the new entity knowledge introduced through CPT can help BERT perform better in fact verification tasks. 2. **Role of Information Authenticity**: It explores the impact of the authenticity of input data (i.e., correct or incorrect information) on BERT's accuracy in fact verification. The study hypothesizes that incorrect information will negatively affect model performance. 3. **Robustness of CPT**: It tests the impact of manipulating input data during the CPT process (e.g., providing incorrect information or shuffling word order) on model performance. The study hypothesizes that CPT is not robust when faced with incorrect information but does not degrade performance when faced with meaningless data (e.g., shuffled word order). Through these research questions, the authors hope to verify the effectiveness and stability of the CPT technique in handling domain-specific entity knowledge and provide valuable insights for future research.

Bag of Lies: Robustness in Continuous Pre-training BERT

On Robustness and Bias Analysis of BERT-Based Relation Extraction

RoBERTa: A Robustly Optimized BERT Pretraining Approach

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective

Testing the Generalization of Neural Language Models for COVID-19 Misinformation Detection

Healing Powers of BERT: How Task-Specific Fine-Tuning Recovers Corrupted Language Models

Auditing and Robustifying COVID-19 Misinformation Datasets via Anticontent Sampling

Unveiling the Potential of BERTopic for Multilingual Fake News Analysis -- Use Case: Covid-19

InfoBERT: Improving Robustness of Language Models from an Information Theoretic Perspective

Transformer-Based Language Model Fine-Tuning Methods for COVID-19 Fake News Detection

Survey of BERT-Base Models for Scientific Text Classification: COVID-19 Case Study

Breaking BERT: Understanding its Vulnerabilities for Named Entity Recognition through Adversarial Attack

Comment on "Observation of a noise-induced phase transition with an analog simulator"

COVID-Twitter-BERT: A natural language processing model to analyse COVID-19 content on Twitter

Exploring COVID-related relationship extraction: Contrasting data sources and analyzing misinformation

Robustness and Sensitivity of BERT Models Predicting Alzheimer's Disease from Text

Evaluating Concurrent Robustness of Language Models Across Diverse Challenge Sets

Verifying the Robustness of Automatic Credibility Assessment

Supervised Contrastive Learning for Multimodal Unreliable News Detection in COVID-19 Pandemic

ContraBERT: Enhancing Code Pre-trained Models via Contrastive Learning

On Effectively Learning of Knowledge in Continual Pre-training