CASA: Class-Agnostic Shared Attributes in Vision-Language Models for Efficient Incremental Object Detection

Mingyi Guo,Yuyang Liu,Zongying Lin,Peixi Peng,Yonghong Tian
2024-10-11
Abstract:Incremental object detection (IOD) is challenged by background shift, where background categories in sequential data may include previously learned or future classes. Inspired by the vision-language foundation models such as CLIP, these models capture shared attributes from extensive image-text paired data during pre-training. We propose a novel method utilizing attributes in vision-language foundation models for incremental object detection. Our method constructs a Class-Agnostic Shared Attribute base (CASA) to capture common semantic information among incremental classes. Specifically, we utilize large language models to generate candidate textual attributes and select the most relevant ones based on current training data, recording their significance in an attribute assignment matrix. For subsequent tasks, we freeze the retained attributes and continue selecting from the remaining candidates while updating the attribute assignment matrix accordingly. Furthermore, we employ OWL-ViT as our baseline, preserving the original parameters of the pre-trained foundation model. Our method adds only 0.7% to parameter storage through parameter-efficient fine-tuning to significantly enhance the scalability and adaptability of IOD. Extensive two-phase and multi-phase experiments on the COCO dataset demonstrate the state-of-the-art performance of our proposed method.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of background shift in Incremental Object Detection (IOD). Specifically, when new classes are gradually introduced, the change in background classes will cause the model's recognition ability for old classes to decline, and even misclassification may occur. This is a very important challenge in the real world because new object classes will continue to emerge over time, and existing models have difficulty adapting to these changes. #### Specific manifestations of the background shift problem 1. **Unlabeled background classes**: In the incremental learning process, objects from previous or future tasks may not be labeled in the current task and are instead misclassified as background. 2. **Confusion problem**: Due to the change in background classes, the model may confuse these background objects with newly introduced classes, resulting in performance degradation. 3. **Forgetting problem**: As new classes are added, the model may forget the class features learned previously and fail to maintain good recognition ability for old classes. #### Solutions To address these challenges, the authors propose a new method named Class - Agnostic Shared Attributes (CASA). This method utilizes vision - language foundation models to enhance the ability of incremental object detection by generating and selecting shared attributes. The specific steps are as follows: 1. **Generate candidate text attributes**: Use a large - language model (such as GPT - 3.5) to generate a large amount of attribute information related to different object classes. 2. **Select relevant attributes**: Select the most relevant attributes from the candidate attributes according to the relevance of the current training data, and record their importance in an attribute assignment matrix. 3. **Freeze selected attributes**: In subsequent tasks, freeze the selected attributes and continue to select from the remaining candidate attributes and update the attribute assignment matrix. 4. **Parameter - efficient fine - tuning**: Based on a pre - trained foundation model (such as OWL - ViT), through parameter - efficient fine - tuning, only increase the parameter storage by 0.7%, significantly improving the scalability and adaptability of IOD. Through this method, CASA can effectively alleviate the background shift problem and improve the robustness and accuracy of incremental object detection. ### Summary The core problem of this paper is to solve the background shift problem in incremental object detection. It proposes the CASA method, which uses vision - language foundation models to generate and select shared attributes, thereby enhancing the model's incremental learning ability and adaptability to new classes.