Abstract:In the field of Class Incremental Object Detection (CIOD), creating models that can continuously learn like humans is a major challenge. Pseudo-labeling methods, although initially powerful, struggle with multi-scenario incremental learning due to their tendency to forget past knowledge. To overcome this, we introduce a new approach called Vision-Language Model assisted Pseudo-Labeling (VLM-PL). This technique uses Vision-Language Model (VLM) to verify the correctness of pseudo ground-truths (GTs) without requiring additional model training. VLM-PL starts by deriving pseudo GTs from a pre-trained detector. Then, we generate custom queries for each pseudo GT using carefully designed prompt templates that combine image and text features. This allows the VLM to classify the correctness through its responses. Furthermore, VLM-PL integrates refined pseudo and real GTs from upcoming training, effectively combining new and old knowledge. Extensive experiments conducted on the Pascal VOC and MS COCO datasets not only highlight VLM-PL's exceptional performance in multi-scenario but also illuminate its effectiveness in dual-scenario by achieving state-of-the-art results in both.

What problem does this paper attempt to address?

This paper attempts to address the problem of how to maintain the recognition ability of already learned classes in Class Incremental Object Detection (CIOD) while continuously introducing new classes, avoiding catastrophic forgetting. Specifically, traditional pseudo-labeling methods perform well initially but tend to forget past knowledge in multi-scenario incremental learning, leading to performance degradation. To this end, the paper proposes a new method—Vision-Language Model assisted Pseudo-Labeling (VLM-PL), which leverages Vision-Language Models (VLM) to verify the correctness of pseudo-labels, thereby improving the model's performance in multi-scenario incremental learning. ### Main Contributions of the Paper: 1. **First Application of VLM in CIOD**: The paper integrates Vision-Language Models (VLM) into Class Incremental Object Detection for the first time, addressing challenges that were previously insufficiently resolved in this field. 2. **Effective Prompt-Tuning and Input-Output Process**: An effective method is proposed that handles multi-class incremental addition scenarios through prompt-tuning and specific input-output processes, overcoming the usual performance degradation in such cases. 3. **Outstanding Experimental Results**: Extensive experiments show that this method not only excels in multi-scenario incremental learning but also achieves new state-of-the-art levels in single-scenario incremental learning, demonstrating the great potential of VLM assistance in object detection. ### Specific Problems Addressed: - **Accuracy of Pseudo-Labels**: Traditional pseudo-labeling methods rely on the performance of previously trained models. As the number of tasks increases, the model's knowledge of early learned objects gradually blurs, leading to a decline in the accuracy of pseudo-labels. VLM-PL ensures the consistency and reliability of pseudo-labels by verifying their correctness through VLM. - **Multi-Scenario Incremental Learning**: In multi-scenario incremental learning, models tend to forget past knowledge, resulting in significant performance degradation. VLM-PL effectively reduces error accumulation by combining new and old knowledge, improving the model's performance in complex scenarios. ### Experimental Results: - **Multi-Scenario Setting**: In the 4-task setting on the PASCAL VOC dataset, VLM-PL achieved an accuracy of 65.5%, which is 6.78% higher than the previous state-of-the-art method DMD+IFD [35]. - **Dual-Scenario Setting**: In the dual-scenario setting on the PASCAL VOC and COCO datasets, VLM-PL also achieved significant performance improvements, especially in single incremental tasks, outperforming existing state-of-the-art methods. In summary, this paper successfully addresses the common problem of catastrophic forgetting in Class Incremental Object Detection by introducing the VLM-PL method, significantly improving the model's performance in multi-scenario incremental learning.

VLM-PL: Advanced Pseudo Labeling Approach for Class Incremental Object Detection via Vision-Language Model

Exploiting Unlabeled Data with Vision and Language Models for Object Detection

Vision-Language Pseudo-Labels for Single-Positive Multi-Label Learning

MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection

Open-Vocabulary Object Detection using Pseudo Caption Labels

VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis

Learning Pseudo-Labeler beyond Noun Concepts for Open-Vocabulary Object Detection

LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction

SSPA: Split-and-Synthesize Prompting with Gated Alignments for Multi-Label Image Recognition

Candidate Pseudolabel Learning: Enhancing Vision-Language Models by Prompt Tuning with Unlabeled Data

Continual Learning of Image Classes with Language Guidance from a Vision-Language Model

P$^3$OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

PVLR: Prompt-driven Visual-Linguistic Representation Learning for Multi-Label Image Recognition

Towards Multimodal In-Context Learning for Vision & Language Models

Learning Task-Aware Language-Image Representation for Class-Incremental Object Detection

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance

Incremental Object Detection with CLIP

Retrieval-Augmented Open-Vocabulary Object Detection

Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

Active Prompt Learning in Vision Language Models