Abstract:In contrast to the incremental classification task, the incremental detection task is characterized by the presence of data ambiguity, as an image may have differently labeled bounding boxes across multiple continuous learning stages. This phenomenon often impairs the model's ability to effectively learn new classes. However, existing research has paid less attention to the forward compatibility of the model, which limits its suitability for incremental learning. To overcome this obstacle, we propose leveraging a visual-language model such as CLIP to generate text feature embeddings for different class sets, which enhances the feature space globally. We then employ super-classes to replace the unavailable novel classes in the early learning stage to simulate the incremental scenario. Finally, we utilize the CLIP image encoder to accurately identify potential objects. We incorporate the finely recognized detection boxes as pseudo-annotations into the training process, thereby further improving the detection performance. We evaluate our approach on various incremental learning settings using the PASCAL VOC 2007 dataset, and our approach outperforms state-of-the-art methods, particularly for recognizing the new classes.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve two main problems in the incremental detection task: **Data Ambiguity** and **Forward Compatibility**. #### 1. Data Ambiguity In the incremental detection task, different from the incremental classification task, an image may have different annotated bounding boxes in multiple consecutive learning stages. This phenomenon will damage the model's ability to learn new classes. Specifically, due to the characteristics of the incremental detection task, an image at a certain stage not only contains objects of the current - stage class, but may also contain previously learned classes and potential new classes. These unseen classes and previously learned class samples will be wrongly regarded as negative samples during the training process, thus leading to compatibility problems. Therefore, the incremental detection task is more challenging than the incremental classification task, and it is difficult to directly apply the incremental classification method to incremental detection. #### 2. Forward Compatibility Most of the existing research mainly focuses on improving the model's **Backward Compatibility** through techniques such as knowledge distillation and replay sampling, that is, ensuring that the model can retain old knowledge. However, few studies have focused on the forward compatibility problem in the incremental detection task, that is, how to make future new classes easily integrated into the current model and expanded based on the data of the current stage. Forward compatibility is particularly important for the incremental detection task because it requires the model to maintain good performance when new classes are introduced in the future. ### Solutions To solve the above problems, the author proposes a method of using language - vision models (such as CLIP) for incremental object detection, called IODC (Incremental Object Detection with CLIP). Specifically: 1. **Generate text feature embeddings**: Use CLIP's text encoder to generate text feature embeddings for different classes to enhance the feature space. 2. **Simulate actual incremental scenarios**: Replace unavailable new classes with broad classes in the early learning stages. 3. **Identify potential objects**: Use CLIP's image encoder to identify potential objects classified as background by the model, and modify the background labels of these proposals to known classes to alleviate the data ambiguity problem. 4. **Class mapping**: Complete knowledge transfer through class mapping, enabling the model to better adapt to new classes. Through these methods, the author has successfully improved the model's incremental ability, especially showing excellent performance in learning new classes, surpassing other state - of - the - art methods. ### Summary This paper solves the data ambiguity and forward compatibility problems in the incremental detection task by introducing the CLIP model, significantly improving the model's performance in incremental learning scenarios, especially in dealing with new classes.

Incremental Object Detection with CLIP

Incremental Object Detection with Image-level Labels

Revisiting Class-Incremental Object Detection: an Efficient Approach Via Intrinsic Characteristics Alignment and Task Decoupling

Contrastive R-CNN for Incremental Learning in Object Detection.

Incremental Detection of Remote Sensing Objects with Feature Pyramid and Knowledge Distillation

Incremental Learning of Object Detectors without Catastrophic Forgetting

Incremental Object Detection Method Based on Border Distance Measurement

CASA: Class-Agnostic Shared Attributes in Vision-Language Models for Efficient Incremental Object Detection

A Class-Incremental Detection Method of Remote Sensing Images Based on Selective Distillation

Class Incremental Learning with Pre-trained Vision-Language Models

A New Knowledge Distillation for Incremental Object Detection

Class-Incremental Learning with CLIP: Adaptive Representation Adjustment and Parameter Fusion

AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning

CIOD: an intelligent class-incremental object detection system with nearest mean of exemplars

Towards Generalized and Incremental Few-Shot Object Detection

Towards Class-incremental Object Detection with Nearest Mean of Exemplars

Towards Non Co-occurrence Incremental Object Detection with Unlabeled In-the-Wild Data

Incremental Learning of Object Detection with Output Merging of Compact Expert Detectors

Bridging Non Co-occurrence with Unlabeled In-the-wild Data for Incremental Object Detection

An integrated classification model for incremental learning

RD-IOD: Two-Level Residual-Distillation-Based Triple-Network for Incremental Object Detection