Learning without Forgetting for Vision-Language Models

Da-Wei Zhou,Yuanhan Zhang,Jingyi Ning,Han-Jia Ye,De-Chuan Zhan,Ziwei Liu

2023-05-31

Abstract:Class-Incremental Learning (CIL) or continual learning is a desired capability in the real world, which requires a learning system to adapt to new tasks without forgetting former ones. While traditional CIL methods focus on visual information to grasp core features, recent advances in Vision-Language Models (VLM) have shown promising capabilities in learning generalizable representations with the aid of textual information. However, when continually trained with new classes, VLMs often suffer from catastrophic forgetting of former knowledge. Applying VLMs to CIL poses two major challenges: 1) how to adapt the model without forgetting; and 2) how to make full use of the multi-modal information. To this end, we propose PROjectiOn Fusion (PROOF) that enables VLMs to learn without forgetting. To handle the first challenge, we propose training task-specific projections based on the frozen image/text encoders. When facing new tasks, new projections are expanded and former projections are fixed, alleviating the forgetting of old concepts. For the second challenge, we propose the fusion module to better utilize the cross-modality information. By jointly adjusting visual and textual features, the model can capture semantic information with stronger representation ability. Extensive experiments on nine benchmark datasets validate PROOF achieves state-of-the-art performance.

Computer Vision and Pattern Recognition,Machine Learning

What problem does this paper attempt to address?

The paper attempts to address the problem of achieving Class-Incremental Learning (CIL) in Vision-Language Models (VLM), which involves enabling the model to learn new knowledge without forgetting old knowledge as new categories are continuously introduced. Specifically, the paper points out that current class-incremental learning methods mainly focus on visual information, neglecting the potential of textual information in constructing generalized feature representations. When Vision-Language Models (VLM) are continuously trained to adapt to new categories, catastrophic forgetting often occurs, meaning that learning new concepts overwrites old knowledge, leading to performance degradation. Therefore, the paper identifies two main challenges: 1. How to adapt to new tasks without forgetting old knowledge. 2. How to fully utilize multimodal information (visual and textual information). To address these two issues, the authors propose a method called "Projection Fusion" (PROOF), which is implemented through the following two key techniques: 1. **Task-Specific Projection**: Freeze the pre-trained image/text encoders and add linear projection layers on top of them. For new tasks, extend new projection layers while freezing the old projection layers, thereby preserving old knowledge. 2. **Cross-Modal Fusion**: Adjust the embeddings of query instances and contextual information through a self-attention mechanism to promote the fusion of visual and textual information, enhancing the model's predictive performance. Through these techniques, PROOF effectively incorporates new categories into the model while resisting the forgetting of old categories, achieving state-of-the-art performance on multiple benchmark datasets.

Learning without Forgetting for Vision-Language Models

Continual Learning of Image Classes with Language Guidance from a Vision-Language Model

Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning

Separable Mixture of Low-Rank Adaptation for Continual Visual Instruction Tuning

Class Incremental Learning with Pre-trained Vision-Language Models

Continual Vision-Language Retrieval Via Dynamic Knowledge Rectification

Visual In-Context Learning for Large Vision-Language Models

Towards Multimodal In-Context Learning for Vision & Language Models

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models

Advancing Cross-domain Discriminability in Continual Learning of Vision-Language Models

Towards Better Vision-Inspired Vision-Language Models

Enhancing Visual Continual Learning with Language-Guided Supervision

Advancing Cross-domain Discriminability in Continual Learning of Vison-Language Models

VIGC: Visual Instruction Generation and Correction

Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models

Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters

VILA: On Pre-training for Visual Language Models

MCF-VC: Mitigate Catastrophic Forgetting in Class-Incremental Learning for Multimodal Video Captioning

EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

Don't Stop Learning: Towards Continual Learning for the CLIP Model