CA-CLIP: category-aware adaptation of CLIP model for few-shot class-incremental learning
Xu, Yuqiao,Huang, Shucheng,Zhou, Haoliang
DOI: https://doi.org/10.1007/s00530-024-01322-y
IF: 3.9
2024-04-24
Multimedia Systems
Abstract:Few-shot class-incremental learning (FSCIL) learns from continuously arriving new categories, each with only a small number of training samples. As a challenging problem, FSCIL aims to mitigate the catastrophic forgetting of old knowledge while preventing overfitting to new categories. Vision-language pre-training (VLP) models achieve effective cross-modal interaction between visual and textual information, providing generalized representations. However, when fine-tuning VLP models, e.g., CLIP, to handle the FSCIL problem, learning category-specific features from continuously coming new categories with few-shot samples remains arduous and is often accompanied by the forgetting of old knowledge. To address the aforementioned problem, we propose category-aware adaptation of CLIP model (CA-CLIP) for FSCIL. To retain category-agnostic knowledge while extracting useful category-specific information, we introduce a target-focused adapter (TFA). By leveraging semantic-rich image patch embeddings, the TFA enables the class token to focus more on the category-related target features. Additionally, to filter semantic-rich patch embeddings, we propose a patch filtering (PF) module, which selects patch embeddings with positive contributions to the class token based on the attention distribution. Moreover, we apply the bi-level optimization strategy based on meta-learning to optimize the model to learn how to balance the retention of old knowledge and the learning of new knowledge. Numerous experiments demonstrate that our approach achieves competitive results compared to state-of-the-art methods on three FSCIL benchmark datasets.
computer science, information systems, theory & methods