Abstract:Open-vocabulary detection (OVD) is a new object detection paradigm, aiming to localize and recognize unseen objects defined by an unbounded vocabulary. This is challenging since traditional detectors can only learn from pre-defined categories and thus fail to detect and localize objects out of pre-defined vocabulary. To handle the challenge, OVD leverages pre-trained cross-modal VLM, such as CLIP, ALIGN, etc. Previous works mainly focus on the open vocabulary classification part, with less attention on the localization part. We argue that for a good OVD detector, both classification and localization should be parallelly studied for the novel object categories. We show in this work that improving localization as well as cross-modal classification complement each other, and compose a good OVD detector jointly. We analyze three families of OVD methods with different design emphases. We first propose a vanilla method,i.e., cropping a bounding box obtained by a localizer and resizing it into the CLIP. We next introduce another approach, which combines a standard two-stage object detector with CLIP. A two-stage object detector includes a visual backbone, a region proposal network (RPN), and a region of interest (RoI) head. We decouple RPN and ROI head (DRR) and use RoIAlign to extract meaningful features. In this case, it avoids resizing objects. To further accelerate the training time and reduce the model parameters, we couple RPN and ROI head (CRR) as the third approach. We conduct extensive experiments on these three types of approaches in different settings. On the OVD-COCO benchmark, DRR obtains the best performance and achieves 35.8 Novel AP$_{50}$, an absolute 2.8 gain over the previous state-of-the-art (SOTA). For OVD-LVIS, DRR surpasses the previous SOTA by 1.9 AP$_{50}$ in rare categories. We also provide an object detection dataset called PID and provide a baseline on PID.

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Spatial Likelihood Voting with Self-Knowledge Distillation for Weakly Supervised Object Detection.

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation

Distilling DETR with Visual-Linguistic Knowledge for Open-Vocabulary Object Detection

OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer

What Makes Good Open-Vocabulary Detector: A Disassembling Perspective

Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

Learning Object-Language Alignments for Open-Vocabulary Object Detection

SKDF: A Simple Knowledge Distillation Framework for Distilling Open-Vocabulary Knowledge to Open-world Object Detector

Universal Object Detection with Large Vision Model

Open-Vocabulary 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning

OVA-DETR: Open Vocabulary Aerial Object Detection Using Image-Text Alignment and Fusion

F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning

Few-Shot Object Detection by Knowledge Distillation Using Bag-of-Visual-Words Representations

Learning Background Prompts to Discover Implicit Knowledge for Open Vocabulary Object Detection

Open-Vocabulary Object Detection using Pseudo Caption Labels