Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

Yanxin Long,Jianhua Han,Runhui Huang,Xu Hang,Yi Zhu,Chunjing Xu,Xiaodan Liang
DOI: https://doi.org/10.1109/TNNLS.2023.3293484
2023-07-30
Abstract:Inspired by the success of vision-language methods (VLMs) in zero-shot classification, recent works attempt to extend this line of work into object detection by leveraging the localization ability of pre-trained VLMs and generating pseudo labels for unseen classes in a self-training manner. However, since the current VLMs are usually pre-trained with aligning sentence embedding with global image embedding, the direct use of them lacks fine-grained alignment for object instances, which is the core of detection. In this paper, we propose a simple but effective fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD) that introduces a fine-grained visual-text prompt adapting stage to enhance the current self-training paradigm with a more powerful fine-grained alignment. During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task. Furthermore, we propose a visual prompt module to provide the prior task information (i.e., the categories need to be predicted) for the vision branch to better adapt the pre-trained VLM to the downstream tasks. Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the fine-grained alignment issue in Open-Vocabulary Object Detection (OVD). Specifically: 1. **Problems with Existing Methods**: - Current Vision-Language Models (VLMs) are typically pre-trained between global image embeddings and sentence embeddings, lacking the ability for fine-grained alignment of object instances. - Directly using these models to generate pseudo-labels performs poorly, especially in dense prediction tasks. 2. **Proposed Method**: - A Fine-grained Visual-Text Prompt-driven Self-Training Paradigm for Open-Vocabulary Detection (VTP-OVD) is proposed, which enhances the current self-training paradigm by introducing a fine-grained visual-text prompt adaptation stage. - During the adaptation stage, learnable text prompts and visual prompt modules are used to address auxiliary dense pixel-level prediction tasks, achieving better fine-grained alignment. - A visual prompt module is proposed to provide prior task information (i.e., the categories to be predicted) to the visual branch, better adapting the pre-trained VLM to downstream tasks. 3. **Experimental Results**: - Experiments show that this method achieves state-of-the-art performance in open-vocabulary object detection tasks, such as achieving a mean Average Precision (mAP) of 31.5% on unseen categories in the COCO dataset. Through the above methods, the paper addresses the lack of fine-grained alignment in existing OVD methods and significantly improves detection performance.