Learning without Forgetting for Vision-Language Models

Da-Wei Zhou,Yuanhan Zhang,Jingyi Ning,Han-Jia Ye,De-Chuan Zhan,Ziwei Liu
2023-05-31
Abstract:Class-Incremental Learning (CIL) or continual learning is a desired capability in the real world, which requires a learning system to adapt to new tasks without forgetting former ones. While traditional CIL methods focus on visual information to grasp core features, recent advances in Vision-Language Models (VLM) have shown promising capabilities in learning generalizable representations with the aid of textual information. However, when continually trained with new classes, VLMs often suffer from catastrophic forgetting of former knowledge. Applying VLMs to CIL poses two major challenges: 1) how to adapt the model without forgetting; and 2) how to make full use of the multi-modal information. To this end, we propose PROjectiOn Fusion (PROOF) that enables VLMs to learn without forgetting. To handle the first challenge, we propose training task-specific projections based on the frozen image/text encoders. When facing new tasks, new projections are expanded and former projections are fixed, alleviating the forgetting of old concepts. For the second challenge, we propose the fusion module to better utilize the cross-modality information. By jointly adjusting visual and textual features, the model can capture semantic information with stronger representation ability. Extensive experiments on nine benchmark datasets validate PROOF achieves state-of-the-art performance.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the problem of achieving Class-Incremental Learning (CIL) in Vision-Language Models (VLM), which involves enabling the model to learn new knowledge without forgetting old knowledge as new categories are continuously introduced. Specifically, the paper points out that current class-incremental learning methods mainly focus on visual information, neglecting the potential of textual information in constructing generalized feature representations. When Vision-Language Models (VLM) are continuously trained to adapt to new categories, catastrophic forgetting often occurs, meaning that learning new concepts overwrites old knowledge, leading to performance degradation. Therefore, the paper identifies two main challenges: 1. How to adapt to new tasks without forgetting old knowledge. 2. How to fully utilize multimodal information (visual and textual information). To address these two issues, the authors propose a method called "Projection Fusion" (PROOF), which is implemented through the following two key techniques: 1. **Task-Specific Projection**: Freeze the pre-trained image/text encoders and add linear projection layers on top of them. For new tasks, extend new projection layers while freezing the old projection layers, thereby preserving old knowledge. 2. **Cross-Modal Fusion**: Adjust the embeddings of query instances and contextual information through a self-attention mechanism to promote the fusion of visual and textual information, enhancing the model's predictive performance. Through these techniques, PROOF effectively incorporates new categories into the model while resisting the forgetting of old categories, achieving state-of-the-art performance on multiple benchmark datasets.