Abstract:This paper explores the problem of continual learning (CL) of vision-language models (VLMs) in open domains, where the models need to perform continual updating and inference on a streaming of datasets from diverse seen and unseen domains with novel classes. Such a capability is crucial for various applications in open environments, e.g., AI assistants, autonomous driving systems, and robotics. Current CL studies mostly focus on closed-set scenarios in a single domain with known classes. Large pre-trained VLMs like CLIP have demonstrated superior zero-shot recognition ability, and a number of recent studies leverage this ability to mitigate catastrophic forgetting in CL, but they focus on closed-set CL in a single domain dataset. Open-domain CL of large VLMs is significantly more challenging due to 1) large class correlations and domain gaps across the datasets and 2) the forgetting of zero-shot knowledge in the pre-trained VLMs in addition to the knowledge learned from the newly adapted datasets. In this work we introduce a novel approach, termed CoLeCLIP, that learns an open-domain CL model based on CLIP. It addresses these challenges by a joint learning of a set of task prompts and a cross-domain class vocabulary. Extensive experiments on 11 domain datasets show that CoLeCLIP outperforms state-of-the-art methods for open-domain CL under both task- and class-incremental learning settings.

What problem does this paper attempt to address?

The paper discusses the problem of open-domain continual learning (ODCL), specifically focusing on the challenges faced by Vision-Language Models (VLMs). These models require continuous updates and inference in different domains with new categories of data. Unlike traditional continual learning (CL) that mainly focuses on known categories within a single domain, ODCL needs to address the large-scale relevance and domain gaps between different tasks, as well as the potential zero-shot knowledge forgetting when adapting to new data for large-scale pre-trained VLMs. To address this problem, the paper introduces a new approach called CoLeCLIP, which is based on the CLIP model and tackles the challenges through joint task conditioning and cross-domain vocabulary learning. Specifically, CoLeCLIP captures domain-specific patterns by learning task cues and avoids forgetting through parameter-efficient fine-tuning (PEFT) module and cross-domain class vocabulary learning, including zero-shot recognition capability of pre-trained models and knowledge adaptation to new tasks. Experiments show that CoLeCLIP outperforms existing methods in both task incremental learning (TIL) and class incremental learning (CIL) settings on 11 domain datasets, demonstrating its superior performance in open-domain continual learning. Furthermore, compared to existing continual learning methods, CoLeCLIP is more lightweight and does not require large-scale external datasets for knowledge distillation, thereby reducing resource and computational time requirements. In summary, the main contributions of the paper include: 1. Introducing the problem of open-domain continual learning, highlighting the recognition ability for known and novel categories in known and unknown domains while preserving the zero-shot knowledge from pre-training and new knowledge learned from downstream tasks. 2. Proposing the lightweight yet effective CoLeCLIP approach, which addresses the unique challenges of open-domain CL through joint learning of task cues and class embeddings. 3. Conducting extensive experiments on 11 domain datasets, demonstrating that CoLeCLIP outperforms state-of-the-art methods in both task and class incremental learning settings.

CoLeCLIP: Open-Domain Continual Learning via Joint Task Prompt and Vocabulary Learning

Advancing Cross-domain Discriminability in Continual Learning of Vision-Language Models

Advancing Cross-domain Discriminability in Continual Learning of Vison-Language Models

How Much Can CLIP Benefit Vision-and-Language Tasks?

Boosting Open-Domain Continual Learning via Leveraging Intra-domain Category-aware Prototype

Don't Stop Learning: Towards Continual Learning for the CLIP Model

Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning

CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks

Continual Vision-Language Representation Learning with Off-Diagonal Information

Continual Learning in Open-vocabulary Classification with Complementary Memory Systems

ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation

Open-Vocabulary Calibration for Fine-tuned CLIP

LVP-CLIP:Revisiting CLIP for Continual Learning with Label Vector Pool

Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition

Learning to Prompt for Vision-Language Models

kNN-CLIP: Retrieval Enables Training-Free Segmentation on Continually Expanding Large Vocabularies

Building an Open-Vocabulary Video CLIP Model With Better Architectures, Optimization and Data

Interactive Continual Learning: Fast and Slow Thinking

Class Incremental Learning with Pre-trained Vision-Language Models

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Delving into the Openness of CLIP