Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

Yanxin Long,Jianhua Han,Runhui Huang,Xu Hang,Yi Zhu,Chunjing Xu,Xiaodan Liang

DOI: https://doi.org/10.1109/TNNLS.2023.3293484

2023-07-30

Abstract:Inspired by the success of vision-language methods (VLMs) in zero-shot classification, recent works attempt to extend this line of work into object detection by leveraging the localization ability of pre-trained VLMs and generating pseudo labels for unseen classes in a self-training manner. However, since the current VLMs are usually pre-trained with aligning sentence embedding with global image embedding, the direct use of them lacks fine-grained alignment for object instances, which is the core of detection. In this paper, we propose a simple but effective fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD) that introduces a fine-grained visual-text prompt adapting stage to enhance the current self-training paradigm with a more powerful fine-grained alignment. During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task. Furthermore, we propose a visual prompt module to provide the prior task information (i.e., the categories need to be predicted) for the vision branch to better adapt the pre-trained VLM to the downstream tasks. Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the fine-grained alignment issue in Open-Vocabulary Object Detection (OVD). Specifically: 1. **Problems with Existing Methods**: - Current Vision-Language Models (VLMs) are typically pre-trained between global image embeddings and sentence embeddings, lacking the ability for fine-grained alignment of object instances. - Directly using these models to generate pseudo-labels performs poorly, especially in dense prediction tasks. 2. **Proposed Method**: - A Fine-grained Visual-Text Prompt-driven Self-Training Paradigm for Open-Vocabulary Detection (VTP-OVD) is proposed, which enhances the current self-training paradigm by introducing a fine-grained visual-text prompt adaptation stage. - During the adaptation stage, learnable text prompts and visual prompt modules are used to address auxiliary dense pixel-level prediction tasks, achieving better fine-grained alignment. - A visual prompt module is proposed to provide prior task information (i.e., the categories to be predicted) to the visual branch, better adapting the pre-trained VLM to downstream tasks. 3. **Experimental Results**: - Experiments show that this method achieves state-of-the-art performance in open-vocabulary object detection tasks, such as achieving a mean Average Precision (mAP) of 31.5% on unseen categories in the COCO dataset. Through the above methods, the paper addresses the lack of fine-grained alignment in existing OVD methods and significantly improves detection performance.

Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

P$^3$OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

Spatial Likelihood Voting with Self-Knowledge Distillation for Weakly Supervised Object Detection.

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

Scene-adaptive and Region-aware Multi-modal Prompt for Open Vocabulary Object Detection

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Open-Vocabulary Object Detection via Language Hierarchy

What Makes Good Open-Vocabulary Detector: A Disassembling Perspective

Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model

Learning Object-Language Alignments for Open-Vocabulary Object Detection

LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors

Open-Vocabulary Object Detection with Meta Prompt Representation and Instance Contrastive Optimization

Learning Background Prompts to Discover Implicit Knowledge for Open Vocabulary Object Detection

Region-centric Image-Language Pretraining for Open-Vocabulary Detection

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

Open-Vocabulary Object Detection with an Open Corpus

F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

LOVD: Large-and-Open Vocabulary Object Detection

MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection

Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection

Fine-Grained Visual Text Prompting