Continual Pre-Training Mitigates Forgetting in Language and Vision

Andrea Cossu,Tinne Tuytelaars,Antonio Carta,Lucia Passaro,Vincenzo Lomonaco,Davide Bacciu
DOI: https://doi.org/10.48550/arXiv.2205.09357
2022-05-19
Abstract:Pre-trained models are nowadays a fundamental component of machine learning research. In continual learning, they are commonly used to initialize the model before training on the stream of non-stationary data. However, pre-training is rarely applied during continual learning. We formalize and investigate the characteristics of the continual pre-training scenario in both language and vision environments, where a model is continually pre-trained on a stream of incoming data and only later fine-tuned to different downstream tasks. We show that continually pre-trained models are robust against catastrophic forgetting and we provide strong empirical evidence supporting the fact that self-supervised pre-training is more effective in retaining previous knowledge than supervised protocols. Code is provided at <a class="link-external link-https" href="https://github.com/AndreaCossu/continual-pretraining-nlp-vision" rel="external noopener nofollow">this https URL</a> .
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to mitigate catastrophic forgetting through continual pre - training in the process of continual learning (CL). Specifically, the paper explores that in the contexts of natural language processing (NLP) and computer vision (CV), the model conducts continual pre - training while continuously receiving new data streams, and then fine - tunes for different downstream tasks. By this method, the model can maintain its ability to remember previous knowledge and improve its performance on new tasks. The main contributions of the paper are as follows: 1. **Formalize the Continual Pre - Training Scenario**: For the first time, the paper formally defines the scenario of continual pre - training and describes an evaluation method to measure the impact of continual pre - training on catastrophic forgetting. 2. **Construct Evaluation Environments for NLP and CV**: In order to comprehensively study the effect of continual pre - training, the paper constructs two evaluation environments based on natural language processing and computer vision tasks respectively, and conducts exhaustive research using different datasets, model architectures, and pre - training protocols. 3. **Prove the Effectiveness of Unsupervised / Self - supervised Pre - training**: Research shows that unsupervised or self - supervised pre - training protocols are more effective in mitigating forgetting than supervised protocols. This indicates that continual pre - training can not only help the model adapt to new data, but also better retain the previously learned knowledge. 4. **Analyze the Changes in the Model Feature Space**: Through linear evaluation of the model feature space and centered kernel alignment (CKA) analysis, the paper further verifies that supervised pre - training will lead to greater feature drift, while self - supervised pre - training can better maintain feature consistency. In conclusion, through a series of experiments and analyses, this paper proves that continual pre - training, as an effective strategy, can help the model maintain the memory of old knowledge and effectively learn new knowledge when facing continuous data streams without significantly increasing additional costs. This has important theoretical and practical significance for applications that need to adapt to changing environments for a long time, such as online learning systems, robotics, etc.