Abstract:Incremental object detection (IOD) is challenged by background shift, where background categories in sequential data may include previously learned or future classes. Inspired by the vision-language foundation models such as CLIP, these models capture shared attributes from extensive image-text paired data during pre-training. We propose a novel method utilizing attributes in vision-language foundation models for incremental object detection. Our method constructs a Class-Agnostic Shared Attribute base (CASA) to capture common semantic information among incremental classes. Specifically, we utilize large language models to generate candidate textual attributes and select the most relevant ones based on current training data, recording their significance in an attribute assignment matrix. For subsequent tasks, we freeze the retained attributes and continue selecting from the remaining candidates while updating the attribute assignment matrix accordingly. Furthermore, we employ OWL-ViT as our baseline, preserving the original parameters of the pre-trained foundation model. Our method adds only 0.7% to parameter storage through parameter-efficient fine-tuning to significantly enhance the scalability and adaptability of IOD. Extensive two-phase and multi-phase experiments on the COCO dataset demonstrate the state-of-the-art performance of our proposed method.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of background shift in Incremental Object Detection (IOD). Specifically, when new classes are gradually introduced, the change in background classes will cause the model's recognition ability for old classes to decline, and even misclassification may occur. This is a very important challenge in the real world because new object classes will continue to emerge over time, and existing models have difficulty adapting to these changes. #### Specific manifestations of the background shift problem 1. **Unlabeled background classes**: In the incremental learning process, objects from previous or future tasks may not be labeled in the current task and are instead misclassified as background. 2. **Confusion problem**: Due to the change in background classes, the model may confuse these background objects with newly introduced classes, resulting in performance degradation. 3. **Forgetting problem**: As new classes are added, the model may forget the class features learned previously and fail to maintain good recognition ability for old classes. #### Solutions To address these challenges, the authors propose a new method named Class - Agnostic Shared Attributes (CASA). This method utilizes vision - language foundation models to enhance the ability of incremental object detection by generating and selecting shared attributes. The specific steps are as follows: 1. **Generate candidate text attributes**: Use a large - language model (such as GPT - 3.5) to generate a large amount of attribute information related to different object classes. 2. **Select relevant attributes**: Select the most relevant attributes from the candidate attributes according to the relevance of the current training data, and record their importance in an attribute assignment matrix. 3. **Freeze selected attributes**: In subsequent tasks, freeze the selected attributes and continue to select from the remaining candidate attributes and update the attribute assignment matrix. 4. **Parameter - efficient fine - tuning**: Based on a pre - trained foundation model (such as OWL - ViT), through parameter - efficient fine - tuning, only increase the parameter storage by 0.7%, significantly improving the scalability and adaptability of IOD. Through this method, CASA can effectively alleviate the background shift problem and improve the robustness and accuracy of incremental object detection. ### Summary The core problem of this paper is to solve the background shift problem in incremental object detection. It proposes the CASA method, which uses vision - language foundation models to generate and select shared attributes, thereby enhancing the model's incremental learning ability and adaptability to new classes.

CASA: Class-Agnostic Shared Attributes in Vision-Language Models for Efficient Incremental Object Detection

Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics.

Incremental Object Detection with CLIP

Incremental Object Detection with Image-level Labels

CIOD: an intelligent class-incremental object detection system with nearest mean of exemplars

Simple Image-level Classification Improves Open-vocabulary Object Detection

Learning Task-Aware Language-Image Representation for Class-Incremental Object Detection

Bridge Past and Future: Overcoming Information Asymmetry in Incremental Object Detection

Context-aware Feature Reconstruction for Class-Incremental Anomaly Detection and Localization

Efficient Adaptive Human-Object Interaction Detection with Concept-guided Memory

MultIOD: Rehearsal-free Multihead Incremental Object Detector

A Class-Incremental Detection Method of Remote Sensing Images Based on Selective Distillation

Incremental Detection of Remote Sensing Objects with Feature Pyramid and Knowledge Distillation

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

Bridging Non Co-occurrence with Unlabeled In-the-wild Data for Incremental Object Detection

Enhancing class-incremental object detection in remote sensing through instance-aware distillation

Purified Distillation: Bridging Domain Shift and Category Gap in Incremental Object Detection

Towards Non Co-occurrence Incremental Object Detection with Unlabeled In-the-Wild Data

TagOOD: A Novel Approach to Out-of-Distribution Detection via Vision-Language Representations and Class Center Learning

RD-IOD: Two-Level Residual-Distillation-Based Triple-Network for Incremental Object Detection

Class Incremental Learning with Pre-trained Vision-Language Models