Abstract:In contrast to the incremental classification task, the incremental detection task is characterized by the presence of data ambiguity, as an image may have differently labeled bounding boxes across multiple continuous learning stages. This phenomenon often impairs the model's ability to effectively learn new classes. However, existing research has paid less attention to the forward compatibility of the model, which limits its suitability for incremental learning. To overcome this obstacle, we propose leveraging a visual-language model such as CLIP to generate text feature embeddings for different class sets, which enhances the feature space globally. We then employ super-classes to replace the unavailable novel classes in the early learning stage to simulate the incremental scenario. Finally, we utilize the CLIP image encoder to accurately identify potential objects. We incorporate the finely recognized detection boxes as pseudo-annotations into the training process, thereby further improving the detection performance. We evaluate our approach on various incremental learning settings using the PASCAL VOC 2007 dataset, and our approach outperforms state-of-the-art methods, particularly for recognizing the new classes.
What problem does this paper attempt to address?
### Problems the paper attempts to solve
This paper aims to solve two main problems in the incremental detection task: **Data Ambiguity** and **Forward Compatibility**.
#### 1. Data Ambiguity
In the incremental detection task, different from the incremental classification task, an image may have different annotated bounding boxes in multiple consecutive learning stages. This phenomenon will damage the model's ability to learn new classes. Specifically, due to the characteristics of the incremental detection task, an image at a certain stage not only contains objects of the current - stage class, but may also contain previously learned classes and potential new classes. These unseen classes and previously learned class samples will be wrongly regarded as negative samples during the training process, thus leading to compatibility problems. Therefore, the incremental detection task is more challenging than the incremental classification task, and it is difficult to directly apply the incremental classification method to incremental detection.
#### 2. Forward Compatibility
Most of the existing research mainly focuses on improving the model's **Backward Compatibility** through techniques such as knowledge distillation and replay sampling, that is, ensuring that the model can retain old knowledge. However, few studies have focused on the forward compatibility problem in the incremental detection task, that is, how to make future new classes easily integrated into the current model and expanded based on the data of the current stage. Forward compatibility is particularly important for the incremental detection task because it requires the model to maintain good performance when new classes are introduced in the future.
### Solutions
To solve the above problems, the author proposes a method of using language - vision models (such as CLIP) for incremental object detection, called IODC (Incremental Object Detection with CLIP). Specifically:
1. **Generate text feature embeddings**: Use CLIP's text encoder to generate text feature embeddings for different classes to enhance the feature space.
2. **Simulate actual incremental scenarios**: Replace unavailable new classes with broad classes in the early learning stages.
3. **Identify potential objects**: Use CLIP's image encoder to identify potential objects classified as background by the model, and modify the background labels of these proposals to known classes to alleviate the data ambiguity problem.
4. **Class mapping**: Complete knowledge transfer through class mapping, enabling the model to better adapt to new classes.
Through these methods, the author has successfully improved the model's incremental ability, especially showing excellent performance in learning new classes, surpassing other state - of - the - art methods.
### Summary
This paper solves the data ambiguity and forward compatibility problems in the incremental detection task by introducing the CLIP model, significantly improving the model's performance in incremental learning scenarios, especially in dealing with new classes.