Learn or Recall? Revisiting Incremental Learning with Pre-trained Language Models

Junhao Zheng,Shengjie Qiu,Qianli Ma
2024-08-08
Abstract:Incremental Learning (IL) has been a long-standing problem in both vision and Natural Language Processing (NLP) communities. In recent years, as Pre-trained Language Models (PLMs) have achieved remarkable progress in various NLP downstream tasks, utilizing PLMs as backbones has become a common practice in recent research of IL in NLP. Most assume that catastrophic forgetting is the biggest obstacle to achieving superior IL performance and propose various techniques to overcome this issue. However, we find that this assumption is problematic. Specifically, we revisit more than 20 methods on four classification tasks (Text Classification, Intent Classification, Relation Extraction, and Named Entity Recognition) under the two most popular IL settings (Class-Incremental and Task-Incremental) and reveal that most of them severely underestimate the inherent anti-forgetting ability of PLMs. Based on the observation, we propose a frustratingly easy method called SEQ* for IL with PLMs. The results show that SEQ* has competitive or superior performance compared to state-of-the-art (SOTA) IL methods and requires considerably less trainable parameters and training time. These findings urge us to revisit the IL with PLMs and encourage future studies to have a fundamental understanding of the catastrophic forgetting in PLMs. The data, code and scripts are publicly available at <a class="link-external link-https" href="https://github.com/zzz47zzz/codebase-for-incremental-learning-with-llm" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the common misunderstanding and underestimation of catastrophic forgetting when using pre - trained language models (PLMs) in incremental learning (IL). Specifically, by re - examining the performance of more than 20 methods on four classification tasks (text classification, intent classification, relation extraction, and named - entity recognition), the author reveals that most methods seriously underestimate the inherent anti - forgetting ability of PLMs. Based on this observation, the author proposes a simple method SEQ*, which is used to solve the problems in IL, and shows that SEQ* is comparable or superior to the existing state - of - the - art (SOTA) IL methods in performance, while requiring significantly fewer trainable parameters and training time. ### Main Findings 1. **Anti - Forgetting Ability of PLMs**: - Even in the case of sequential fine - tuning (SEQ), PLMs can retain knowledge without significant forgetting. - From the perspective of probing, most existing IL methods do not learn incremental knowledge for PLMs. 2. **Effectiveness of the SEQ* Method**: - By combining simple strategies (such as freezing PLMs, freezing old classifiers, etc.), SEQ* exhibits performance comparable to or even better than SOTA methods in multiple tasks and settings. - SEQ* requires significantly fewer trainable parameters and training time. 3. **Anti - Forgetting Mechanism of PLMs**: - The anti - forgetting ability comes not only from the pre - training stage but also from the Transformer architecture. - Randomly initialized PLMs can also gradually absorb new knowledge when performing SEQ. 4. **Reasons for Forgetting**: - It is the classifier that is truly forgotten, not the PLMs themselves. - The forgetting of the classifier is mainly manifested as the embedding vectors of old classes being pushed away from their initial and optimal positions. ### Methods - **Experimental Setup**: - The author conducted extensive experiments under two popular IL settings (class - incremental learning (CIL) and task - incremental learning (TIL)). - Multiple model architectures (encoder - only and decoder - only) and models of different scales (from 19M to 1.21B parameters) were used. - **Evaluation Metrics**: - Four metrics, namely linear probing, cosine linear probing, prototype probing, and cosine prototype probing, were used to evaluate the probing performance. - **SEQ* Strategies**: - Freeze the PLMs (Freeze the PLMs after warm - up). - Freeze the old classifiers (Freeze the old classifiers when learning new tasks). - Use cosine linear classifiers only when no old data is available in a CIL scenario. - Pre - allocate future classifiers (optional). ### Conclusions - This research urges the NLP community to re - examine and deeply understand the forgetting problem in PLMs. - The proposed SEQ* method provides a simple and effective way to solve the forgetting problem in IL and has broad application prospects.