DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

Shilin Xu,Xiangtai Li,Size Wu,Wenwei Zhang,Yunhai Tong,Chen Change Loy

2024-04-02

Abstract:Open-vocabulary object detection (OVOD) aims to detect the objects beyond the set of classes observed during training. This work introduces a straightforward and efficient strategy that utilizes pre-trained vision-language models (VLM), like CLIP, to identify potential novel classes through zero-shot classification. Previous methods use a class-agnostic region proposal network to detect object proposals and consider the proposals that do not match the ground truth as background. Unlike these methods, our method will select a subset of proposals that will be considered as background during the training. Then, we treat them as novel classes during training. We refer to this approach as the self-training strategy, which enhances recall and accuracy for novel classes without requiring extra annotations, datasets, and re-training. Compared to previous pseudo methods, our approach does not require re-training and offline labeling processing, which is more efficient and effective in one-shot training. Empirical evaluations on three datasets, including LVIS, V3Det, and COCO, demonstrate significant improvements over the baseline performance without incurring additional parameters or computational costs during inference. In addition, we also apply our method to various baselines. In particular, compared with the previous method, F-VLM, our method achieves a 1.7% improvement on the LVIS dataset. Combined with the recent method CLIPSelf, our method also achieves 46.7 novel class AP on COCO without introducing extra data for pertaining. We also achieve over 6.5% improvement over the F-VLM baseline in the recent challenging V3Det dataset. We release our code and models at this https URL.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily aims to address a key issue in Open Vocabulary Object Detection (OVOD): how to better recognize and utilize unseen new categories during the training process. Traditional object detection methods typically can only detect predefined categories encountered during training, whereas in the real world, we need detectors capable of recognizing any object within a large vocabulary range. Specifically, the paper proposes a Dynamic Self-Training Strategy (DST-Det) to solve this problem through the following approaches: 1. **Utilizing Pre-trained Vision-Language Models (VLM)**: - Using pre-trained vision-language models (such as CLIP) to generate pseudo-labels for potential new categories, thereby recognizing these categories in zero-shot classification. 2. **Improving the Training Process**: - Unlike previous methods that treat new categories as background, this method selects a portion of proposals as pseudo-labels for new categories during the training phase, thereby improving the recall and accuracy of new categories. 3. **Efficiency**: - Does not require additional datasets or retraining processes, making this method more efficient and easy to implement. ### Main Contributions - Proposes a new dynamic self-training framework (DST-Det) that can directly utilize large vocabulary information for object detection during the training process. - Validated on multiple benchmark datasets (including LVIS, V3Det, and COCO), significantly improving the detection performance of new categories without adding extra parameters or computational costs. - Demonstrated the effectiveness of the method through detailed ablation experiments and visual analysis.

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

Spatial Likelihood Voting with Self-Knowledge Distillation for Weakly Supervised Object Detection.

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

What Makes Good Open-Vocabulary Detector: A Disassembling Perspective

P$^3$OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning

Open-Vocabulary Object Detection with Meta Prompt Representation and Instance Contrastive Optimization

OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision

Distilling DETR with Visual-Linguistic Knowledge for Open-Vocabulary Object Detection

Learning Object-Language Alignments for Open-Vocabulary Object Detection

OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer

Simple Image-level Classification Improves Open-vocabulary Object Detection

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

LP-OVOD: Open-Vocabulary Object Detection by Linear Probing

Open-Vocabulary 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning

Open-Vocabulary Object Detection using Pseudo Caption Labels

LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction

OSR-ViT: A Simple and Modular Framework for Open-Set Object Detection and Discovery