Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

Shuhuai Ren,Aston Zhang,Yi Zhu,Shuai Zhang,Shuai Zheng,Mu Li,Alex Smola,Xu Sun
2023-10-07
Abstract:This work proposes POMP, a prompt pre-training method for vision-language models. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-thousand classes. Once pre-trained, the prompt with a strong transferable ability can be directly plugged into a variety of visual recognition tasks including image classification, semantic segmentation, and object detection, to boost recognition performances in a zero-shot manner. Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 datasets, e.g., 67.0% average accuracy on 10 classification datasets (+3.1% compared to CoOp) and 84.4 hIoU on open-vocabulary Pascal VOC segmentation (+6.9 compared to ZSSeg). Our code is available at <a class="link-external link-https" href="https://github.com/amazon-science/prompt-pretraining" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the zero - shot performance in large - scale visual recognition tasks, especially when facing a large number of classes (more than 20,000 classes). Specifically, the authors propose a method named POMP (PrOMpt Pre - training), aiming to enhance the zero - shot generalization ability of Vision - Language Models (VLMs) by pre - training a general soft prompt on a large - scale dataset (such as ImageNet - 21K). This method can not only reduce the computational and memory overhead of traditional prompt - tuning methods, but also enable the pre - trained prompt to be directly applied to various downstream tasks, such as image classification, semantic segmentation and object detection, without fine - tuning for each specific task. ### Main Problems 1. **Efficient Prompt Tuning under Large - scale Classes**: Traditional prompt - tuning methods face huge computational and memory overhead when dealing with large - scale classes, especially on datasets like ImageNet - 21K where the number of classes exceeds 20,000. POMP significantly reduces the training cost by introducing local contrast and local correction strategies, making prompt - tuning on large - scale classes possible. 2. **Zero - shot Generalization Ability**: Existing prompt - tuning methods are usually fine - tuned for specific tasks and a limited number of classes, resulting in limited generalization ability on new classes and tasks. POMP pre - trains a general soft prompt on a large - scale dataset, enabling it to perform well on unseen datasets and tasks, especially in the zero - shot setting. ### Solutions 1. **Local Contrast**: - By sampling only a small part of classes (for example, 1000 classes) in each training step instead of using all classes for contrastive learning, the computational and memory overhead is greatly reduced. - This method allows the model to be trained on a constantly changing subset of classes and gradually recover the relationships between all classes. 2. **Local Correction**: - To mitigate the bias brought by local contrast, POMP introduces a local correction term \( m \), which is used to adjust the similarity scores of negative sample classes. - The specific formula is: \[ m = -\log \left( \frac{K - 1}{N - 1} \right) \] - where \( K \) is the number of classes sampled each time, and \( N \) is the total number of classes. This correction term ensures a stricter decision boundary between positive and negative samples, improving the robustness and discrimination ability of the model. 3. **Zero - shot Transfer Learning**: - The pre - trained POMP prompt can be directly used to generate class features for any set of classes, supporting zero - shot inference on downstream datasets and tasks. - By adopting a two - stage framework, the POMP prompt can be applied to tasks such as semantic segmentation and object detection. First, a pre - trained proposal network is used to generate region or mask proposals, and then the class features generated by POMP are used for classification. ### Experimental Results - **Image Classification**: POMP has achieved the highest average accuracy on multiple datasets. For example, it has reached an average accuracy of 67.0% on 10 downstream image classification datasets, which is 3.1% higher than CoOp. - **Semantic Segmentation**: On the open - vocabulary COCO Stuff and Pascal VOC datasets, POMP has reached hIoU of 39.1% and 84.4% respectively, significantly outperforming ZSSeg. - **Object Detection**: In the cross - dataset evaluation from LVIS to COCO and Objects365, POMP has reached AP50 of 57.9 and 22.9 respectively, exceeding Detic. In conclusion, by pre - training a general soft prompt on a large - scale dataset, POMP has successfully solved the problems of efficient prompt - tuning under large - scale classes and zero - shot generalization ability, and significantly improved the performance of various visual recognition tasks.