LP-OVOD: Open-Vocabulary Object Detection by Linear Probing

Chau Pham,Truong Vu,Khoi Nguyen

2024-06-02

Abstract:This paper addresses the challenging problem of open-vocabulary object detection (OVOD) where an object detector must identify both seen and unseen classes in test images without labeled examples of the unseen classes in training. A typical approach for OVOD is to use joint text-image embeddings of CLIP to assign box proposals to their closest text label. However, this method has a critical issue: many low-quality boxes, such as over- and under-covered-object boxes, have the same similarity score as high-quality boxes since CLIP is not trained on exact object location information. To address this issue, we propose a novel method, LP-OVOD, that discards low-quality boxes by training a sigmoid linear classifier on pseudo labels retrieved from the top relevant region proposals to the novel text. Experimental results on COCO affirm the superior performance of our approach over the state of the art, achieving $\textbf{40.5}$ in $\text{AP}_{novel}$ using ResNet50 as the backbone and without external datasets or knowing novel classes during training. Our code will be available at <a class="link-external link-https" href="https://github.com/VinAIResearch/LP-OVOD" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the challenging problem in Open Vocabulary Object Detection (OVOD), which is to recognize known classes (base classes) and unknown classes (novel classes) in test images without annotated examples of novel classes during training. Specifically, the paper proposes a new method called LP-OVOD, which improves the detection performance of novel categories by enhancing the filtering capability of low-quality bounding boxes through linear probing techniques. Traditional methods typically use joint text-image embedding models like CLIP to align bounding box proposals with the nearest text labels. However, this approach has a key issue: many low-quality bounding boxes (such as those with insufficient or excessive coverage) have the same similarity scores as high-quality bounding boxes because CLIP is not trained with precise object location information. This leads to high false positive and false negative rates. To address this issue, the authors propose the LP-OVOD method, which leverages highly discriminative features extracted from the penultimate layer of a pre-trained Faster R-CNN model and trains a Sigmoid linear classifier on these pseudo-labels to discard low-quality bounding boxes. Additionally, the method uses a Sigmoid classifier instead of a Softmax classifier to independently predict the scores for each category, forming a unified classifier suitable for both base and novel categories. Experimental results show that LP-OVOD significantly outperforms existing methods on the COCO dataset without relying on external datasets or knowing the novel categories during training.

LP-OVOD: Open-Vocabulary Object Detection by Linear Probing

What Makes Good Open-Vocabulary Detector: A Disassembling Perspective

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

LOVD: Large-and-Open Vocabulary Object Detection

Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection

Learning Object-Language Alignments for Open-Vocabulary Object Detection

Sampling Bag of Views for Open-Vocabulary Object Detection

Open-Vocabulary Object Detectors: Robustness Challenges under Distribution Shifts

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Simple Image-level Classification Improves Open-vocabulary Object Detection

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision

P$^3$OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

Open-Vocabulary Object Detection using Pseudo Caption Labels

Open-Vocabulary Object Detection with an Open Corpus

Multi-Modal Classifiers for Open-Vocabulary Object Detection

Open-Vocabulary Object Detection with Meta Prompt Representation and Instance Contrastive Optimization

Open-Vocabulary Camouflaged Object Segmentation

OV-DAR: Open-Vocabulary Object Detection and Attributes Recognition

MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection

Learning Pseudo-Labeler beyond Noun Concepts for Open-Vocabulary Object Detection