What Makes Good Open-Vocabulary Detector: A Disassembling Perspective

Jincheng Li,Chunyu Xie,Xiaoyu Wu,Bin Wang,Dawei Leng

2023-09-01

Abstract:Open-vocabulary detection (OVD) is a new object detection paradigm, aiming to localize and recognize unseen objects defined by an unbounded vocabulary. This is challenging since traditional detectors can only learn from pre-defined categories and thus fail to detect and localize objects out of pre-defined vocabulary. To handle the challenge, OVD leverages pre-trained cross-modal VLM, such as CLIP, ALIGN, etc. Previous works mainly focus on the open vocabulary classification part, with less attention on the localization part. We argue that for a good OVD detector, both classification and localization should be parallelly studied for the novel object categories. We show in this work that improving localization as well as cross-modal classification complement each other, and compose a good OVD detector jointly. We analyze three families of OVD methods with different design emphases. We first propose a vanilla method,i.e., cropping a bounding box obtained by a localizer and resizing it into the CLIP. We next introduce another approach, which combines a standard two-stage object detector with CLIP. A two-stage object detector includes a visual backbone, a region proposal network (RPN), and a region of interest (RoI) head. We decouple RPN and ROI head (DRR) and use RoIAlign to extract meaningful features. In this case, it avoids resizing objects. To further accelerate the training time and reduce the model parameters, we couple RPN and ROI head (CRR) as the third approach. We conduct extensive experiments on these three types of approaches in different settings. On the OVD-COCO benchmark, DRR obtains the best performance and achieves 35.8 Novel AP$_{50}$, an absolute 2.8 gain over the previous state-of-the-art (SOTA). For OVD-LVIS, DRR surpasses the previous SOTA by 1.9 AP$_{50}$ in rare categories. We also provide an object detection dataset called PID and provide a baseline on PID.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address two core challenges in Open-Vocabulary Detection (OVD): classification and localization. Specifically: 1. **Classification**: The paper investigates how to leverage large-scale vision-language models (e.g., CLIP) for zero-shot generalization to recognize unseen object categories during inference. 2. **Localization**: It explores how to improve the Region Proposal Network (RPN) to effectively localize objects in unseen categories. To tackle these challenges, the paper proposes three different methods: - **Vanilla Method**: Completely decouples the localization and classification components by cropping and resizing objects before feeding them into CLIP for classification. - **Decoupling RPN and ROI head (DRR)**: Uses a standard two-stage object detector combined with CLIP, but decouples the RPN and ROI head to reduce feature fusion. - **Coupling RPN and ROI head (CRR)**: Re-couples the RPN and ROI head, using a single backbone network to improve training efficiency and model performance. Through these three methods, the paper aims to explore which components can further enhance the overall performance of open-vocabulary detection and conducts experimental validation on multiple benchmark datasets. Among them, the DRR method achieves the best results on the OVD-COCO and OVD-LVIS datasets, demonstrating the potential of decoupling the RPN and ROI head in open-vocabulary detection tasks.

What Makes Good Open-Vocabulary Detector: A Disassembling Perspective

Open-Vocabulary Object Detection with an Open Corpus

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision

Learning Object-Language Alignments for Open-Vocabulary Object Detection

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection

OV-DAR: Open-Vocabulary Object Detection and Attributes Recognition

How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

OVA-DETR: Open Vocabulary Aerial Object Detection Using Image-Text Alignment and Fusion

Distilling DETR with Visual-Linguistic Knowledge for Open-Vocabulary Object Detection

LP-OVOD: Open-Vocabulary Object Detection by Linear Probing

Simple Image-level Classification Improves Open-vocabulary Object Detection

Multi-Modal Classifiers for Open-Vocabulary Object Detection

Retrieval-Augmented Open-Vocabulary Object Detection

Learning Background Prompts to Discover Implicit Knowledge for Open Vocabulary Object Detection

On the Potential of Open-Vocabulary Models for Object Detection in Unusual Street Scenes

Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

Boosting Open-Vocabulary Object Detection by Handling Background Samples

Open-Vocabulary Object Detection via Neighboring Region Attention Alignment