Abstract:Recently, vision-language pre-training shows great potential in open-vocabulary object detection, where detectors trained on base classes are devised for detecting new classes. The class text embedding is firstly generated by feeding prompts to the text encoder of a pre-trained vision-language model. It is then used as the region classifier to supervise the training of a detector. The key element that leads to the success of this model is the proper prompt, which requires careful words tuning and ingenious design. To avoid laborious prompt engineering, there are some prompt representation learning methods being proposed for the image classification task, which however can only be sub-optimal solutions when applied to the detection task. In this paper, we introduce a novel method, detection prompt (DetPro), to learn continuous prompt representations for open-vocabulary object detection based on the pre-trained vision-language model. Different from the previous classification-oriented methods, DetPro has two highlights: 1) a background interpretation scheme to include the proposals in image background into the prompt training; 2) a context grading scheme to separate proposals in image foreground for tailored prompt training. We assemble DetPro with ViLD, a recent state-of-the-art open-world object detector, and conduct experiments on the LVIS as well as transfer learning on the Pascal VOC, COCO, Objects365 datasets. Experimental results show that our DetPro outperforms the baseline ViLD in all settings, e.g., +3.4 APbox and +3.0 APmask improvements on the novel classes of LVIS. Code and models are available at https://github.com/dyabel/detpro.

Multimodal Inplace Prompt Tuning for Open-set Object Detection

Open-Vocabulary Object Detection with Meta Prompt Representation and Instance Contrastive Optimization

R-Tuning: Regularized Prompt Tuning in Open-Set Scenarios

Tuning Multi-mode Token-level Prompt Alignment across Modalities

Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model

DeCoOp: Robust Prompt Tuning with Out-of-Distribution Detection

Unsupervised Prompt Tuning for Text-Driven Object Detection

Modality-invariant and Specific Prompting for Multimodal Human Perception Understanding

Visual Modality Prompt for Adapting Vision-Language Object Detectors

DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM

Bridging the Gap: Neural Collapse Inspired Prompt Tuning for Generalization under Class Imbalance

M^2PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning

Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection

Mobile User Interface Element Detection Via Adaptively Prompt Tuning

Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

Contextual Object Detection with Multimodal Large Language Models

Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model

Unified-modal Salient Object Detection via Adaptive Prompt Learning

Open-World Human-Object Interaction Detection via Multi-modal Prompts