Abstract:Prompt learning is an effective method to customize Vision-Language Models (VLMs) for various downstream tasks, involving tuning very few parameters of input prompt tokens. Recently, prompt pretraining in large-scale dataset (e.g., ImageNet-21K) has played a crucial role in prompt learning for universal visual discrimination. However, we revisit and observe that the limited learnable prompts could face underfitting risks given the extensive images during prompt pretraining, simultaneously leading to poor generalization. To address the above issues, in this paper, we propose a general framework termed Revisiting Prompt Pretraining (RPP), which targets at improving the fitting and generalization ability from two aspects: prompt structure and prompt supervision. For prompt structure, we break the restriction in common practice where query, key, and value vectors are derived from the shared learnable prompt token. Instead, we introduce unshared individual query, key, and value learnable prompts, thereby enhancing the model's fitting capacity through increased parameter diversity. For prompt supervision, we additionally utilize soft labels derived from zero-shot probability predictions provided by a pretrained Contrastive Language Image Pretraining (CLIP) teacher model. These soft labels yield more nuanced and general insights into the inter-class relationships, thereby endowing the pretraining process with better generalization ability. RPP produces a more resilient prompt initialization, enhancing its robust transferability across diverse visual recognition tasks. Experiments across various benchmarks consistently confirm the state-of-the-art (SOTA) performance of our pretrained prompts. Codes and models will be made available soon.

A prompt tuning method for few-shot action recognition.

Prompt Tuning with Soft Context Sharing for Vision-Language Models

Knowledge Prompting for Few-shot Action Recognition

Multi-Task Pre-Training of Modular Prompt for Few-Shot Learning

Multitask Pre-training of Modular Prompt for Chinese Few-Shot Learning

Ontology-enhanced Prompt-tuning for Few-shot Learning

VPPT: Visual Pre-Trained Prompt Tuning Framework for Few-Shot Image Classification

Unified Vision and Language Prompt Learning

Adapting CLIP for Action Recognition via Dual Semantic Supervision and Temporal Prompt Reparameterization

Unified Prompt Learning Makes Pre-Trained Language Models Better Few-Shot Learners

Towards Unified Prompt Tuning for Few-shot Text Classification

Meta-Prompt Tuning Vision-Language Model for Multi-Label Few-Shot Image Recognition

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

Revisiting Prompt Pretraining of Vision-Language Models

Hierarchical Prompt Tuning for Few-Shot Multi-Task Learning

PPT: Pre-trained Prompt Tuning for Few-shot Learning

Prompting through Prototype: A Prototype-based Prompt Learning on Pretrained Vision-Language Models

Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model

Knowledge-Enhanced Prompt Learning for Few-Shot Text Classification

Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning

VoP: Text-Video Co-Operative Prompt Tuning for Cross-Modal Retrieval