Abstract:Learning from pseudo-labels that generated with VLMs~(Vision Language Models) has been shown as a promising solution to assist open vocabulary detection (OVD) in recent studies. However, due to the domain gap between VLM and vision-detection tasks, pseudo-labels produced by the VLMs are prone to be noisy, while the training design of the detector further amplifies the bias. In this work, we investigate the root cause of VLMs' biased prediction under the OVD context. Our observations lead to a simple yet effective paradigm, coded MarvelOVD, that generates significantly better training targets and optimizes the learning procedure in an online manner by marrying the capability of the detector with the vision-language model. Our key insight is that the detector itself can act as a strong auxiliary guidance to accommodate VLM's inability of understanding both the ``background'' and the context of a proposal within the image. Based on it, we greatly purify the noisy pseudo-labels via Online Mining and propose Adaptive Reweighting to effectively suppress the biased training boxes that are not well aligned with the target object. In addition, we also identify a neglected ``base-novel-conflict'' problem and introduce stratified label assignments to prevent it. Extensive experiments on COCO and LVIS datasets demonstrate that our method outperforms the other state-of-the-arts by significant margins. Codes are available at <a class="link-external link-https" href="https://github.com/wkfdb/MarvelOVD" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the noise problem encountered when using vision - language models (VLMs) to generate pseudo - labels in open - vocabulary object detection (OVD). Specifically, due to the domain differences between VLMs in contrastive language - image pre - training tasks and object detection tasks, the pseudo - labels generated by VLMs are prone to noise, and this noise is further amplified by the training design of the detector, affecting the detection performance. By analyzing the root causes of VLMs' prediction biases in the OVD context, the paper proposes a new framework named MarvelOVD. This framework combines the capabilities of the detector and the advantages of the vision - language model to generate high - quality pseudo - labels and optimize the subsequent learning process. The specific contributions are as follows: 1. **Identifying the root causes of VLMs' prediction biases**: The authors conduct an in - depth analysis of the reasons for VLMs generating noisy pseudo - labels in the OVD task, which are mainly attributed to the lack of local image context information and insufficient understanding of "background" elements. 2. **Proposing the MarvelOVD framework**: This framework generates high - quality pseudo - labels by combining the context - aware capabilities of the detector and the background concept, thereby significantly improving the performance of OVD. 3. **Introducing an adaptive proposal re - weighting mechanism**: To address the problem of limited pseudo - label localization quality, the paper proposes an adaptive proposal re - weighting mechanism, which assigns independent loss weights to each training box according to the detector's prediction and the confidence of the pseudo - label, thereby reducing biases. 4. **A hierarchical label assignment method**: To avoid the "base - class - new - class conflict" problem, the paper introduces a hierarchical label assignment method to ensure that the detection performance of base - class objects is not affected while introducing new - class pseudo - labels. Through these innovations, the experimental results of the paper on the COCO and LVIS datasets show that MarvelOVD significantly outperforms the existing state - of - the - art methods.

MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection

SLV: Spatial Likelihood Voting for Weakly Supervised Object Detection

F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

Exploiting Unlabeled Data with Vision and Language Models for Object Detection

From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects

Learning Object-Language Alignments for Open-Vocabulary Object Detection

What Makes Good Open-Vocabulary Detector: A Disassembling Perspective

Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection

Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization

LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction

VLM-PL: Advanced Pseudo Labeling Approach for Class Incremental Object Detection via Vision-Language Model

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

Open-Vocabulary Object Detection using Pseudo Caption Labels

Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

Open-Vocabulary Object Detection with Meta Prompt Representation and Instance Contrastive Optimization

Retrieval-Augmented Open-Vocabulary Object Detection

OVMR: Open-Vocabulary Recognition with Multi-Modal References

LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors

Open-Vocabulary Object Detection with an Open Corpus

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Multi-Modal Classifiers for Open-Vocabulary Object Detection