Abstract:This work proposes POMP, a prompt pre-training method for vision-language models. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-thousand classes. Once pre-trained, the prompt with a strong transferable ability can be directly plugged into a variety of visual recognition tasks including image classification, semantic segmentation, and object detection, to boost recognition performances in a zero-shot manner. Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 datasets, e.g., 67.0% average accuracy on 10 classification datasets (+3.1% compared to CoOp) and 84.4 hIoU on open-vocabulary Pascal VOC segmentation (+6.9 compared to ZSSeg). Our code is available at <a class="link-external link-https" href="https://github.com/amazon-science/prompt-pretraining" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the zero - shot performance in large - scale visual recognition tasks, especially when facing a large number of classes (more than 20,000 classes). Specifically, the authors propose a method named POMP (PrOMpt Pre - training), aiming to enhance the zero - shot generalization ability of Vision - Language Models (VLMs) by pre - training a general soft prompt on a large - scale dataset (such as ImageNet - 21K). This method can not only reduce the computational and memory overhead of traditional prompt - tuning methods, but also enable the pre - trained prompt to be directly applied to various downstream tasks, such as image classification, semantic segmentation and object detection, without fine - tuning for each specific task. ### Main Problems 1. **Efficient Prompt Tuning under Large - scale Classes**: Traditional prompt - tuning methods face huge computational and memory overhead when dealing with large - scale classes, especially on datasets like ImageNet - 21K where the number of classes exceeds 20,000. POMP significantly reduces the training cost by introducing local contrast and local correction strategies, making prompt - tuning on large - scale classes possible. 2. **Zero - shot Generalization Ability**: Existing prompt - tuning methods are usually fine - tuned for specific tasks and a limited number of classes, resulting in limited generalization ability on new classes and tasks. POMP pre - trains a general soft prompt on a large - scale dataset, enabling it to perform well on unseen datasets and tasks, especially in the zero - shot setting. ### Solutions 1. **Local Contrast**: - By sampling only a small part of classes (for example, 1000 classes) in each training step instead of using all classes for contrastive learning, the computational and memory overhead is greatly reduced. - This method allows the model to be trained on a constantly changing subset of classes and gradually recover the relationships between all classes. 2. **Local Correction**: - To mitigate the bias brought by local contrast, POMP introduces a local correction term \( m \), which is used to adjust the similarity scores of negative sample classes. - The specific formula is: \[ m = -\log \left( \frac{K - 1}{N - 1} \right) \] - where \( K \) is the number of classes sampled each time, and \( N \) is the total number of classes. This correction term ensures a stricter decision boundary between positive and negative samples, improving the robustness and discrimination ability of the model. 3. **Zero - shot Transfer Learning**: - The pre - trained POMP prompt can be directly used to generate class features for any set of classes, supporting zero - shot inference on downstream datasets and tasks. - By adopting a two - stage framework, the POMP prompt can be applied to tasks such as semantic segmentation and object detection. First, a pre - trained proposal network is used to generate region or mask proposals, and then the class features generated by POMP are used for classification. ### Experimental Results - **Image Classification**: POMP has achieved the highest average accuracy on multiple datasets. For example, it has reached an average accuracy of 67.0% on 10 downstream image classification datasets, which is 3.1% higher than CoOp. - **Semantic Segmentation**: On the open - vocabulary COCO Stuff and Pascal VOC datasets, POMP has reached hIoU of 39.1% and 84.4% respectively, significantly outperforming ZSSeg. - **Object Detection**: In the cross - dataset evaluation from LVIS to COCO and Objects365, POMP has reached AP50 of 57.9 and 22.9 respectively, exceeding Detic. In conclusion, by pre - training a general soft prompt on a large - scale dataset, POMP has successfully solved the problems of efficient prompt - tuning under large - scale classes and zero - shot generalization ability, and significantly improved the performance of various visual recognition tasks.

Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

Revisiting Prompt Pretraining of Vision-Language Models

Mutual Prompt Leaning for Vision Language Models

Scene-adaptive and Region-aware Multi-modal Prompt for Open Vocabulary Object Detection

TAI++: Text as Image for Multi-Label Image Classification by Co-Learning Transferable Prompt

P$^3$OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

Unleashing the Power of Visual Prompting At the Pixel Level

Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

PRE: Vision-Language Prompt Learning with Reparameterization Encoder

Prompt learning in computer vision: a survey

Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection

Prompting classes: Exploring the Power of Prompt Class Learning in Weakly Supervised Semantic Segmentation

Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model

Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning

Learning to Prompt for Vision-Language Models

Just a Few Glances: Open-Set Visual Perception with Image Prompt Paradigm

MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation

Prompt-Guided DETR with RoI-pruned masked attention for open-vocabulary object detection

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

Prompting through Prototype: A Prototype-based Prompt Learning on Pretrained Vision-Language Models

Visual In-Context Prompting