LP-OVOD: Open-Vocabulary Object Detection by Linear Probing

Chau Pham,Truong Vu,Khoi Nguyen
2024-06-02
Abstract:This paper addresses the challenging problem of open-vocabulary object detection (OVOD) where an object detector must identify both seen and unseen classes in test images without labeled examples of the unseen classes in training. A typical approach for OVOD is to use joint text-image embeddings of CLIP to assign box proposals to their closest text label. However, this method has a critical issue: many low-quality boxes, such as over- and under-covered-object boxes, have the same similarity score as high-quality boxes since CLIP is not trained on exact object location information. To address this issue, we propose a novel method, LP-OVOD, that discards low-quality boxes by training a sigmoid linear classifier on pseudo labels retrieved from the top relevant region proposals to the novel text. Experimental results on COCO affirm the superior performance of our approach over the state of the art, achieving $\textbf{40.5}$ in $\text{AP}_{novel}$ using ResNet50 as the backbone and without external datasets or knowing novel classes during training. Our code will be available at <a class="link-external link-https" href="https://github.com/VinAIResearch/LP-OVOD" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the challenging problem in Open Vocabulary Object Detection (OVOD), which is to recognize known classes (base classes) and unknown classes (novel classes) in test images without annotated examples of novel classes during training. Specifically, the paper proposes a new method called LP-OVOD, which improves the detection performance of novel categories by enhancing the filtering capability of low-quality bounding boxes through linear probing techniques. Traditional methods typically use joint text-image embedding models like CLIP to align bounding box proposals with the nearest text labels. However, this approach has a key issue: many low-quality bounding boxes (such as those with insufficient or excessive coverage) have the same similarity scores as high-quality bounding boxes because CLIP is not trained with precise object location information. This leads to high false positive and false negative rates. To address this issue, the authors propose the LP-OVOD method, which leverages highly discriminative features extracted from the penultimate layer of a pre-trained Faster R-CNN model and trains a Sigmoid linear classifier on these pseudo-labels to discard low-quality bounding boxes. Additionally, the method uses a Sigmoid classifier instead of a Softmax classifier to independently predict the scores for each category, forming a unified classifier suitable for both base and novel categories. Experimental results show that LP-OVOD significantly outperforms existing methods on the COCO dataset without relying on external datasets or knowing the novel categories during training.