Abstract:Large pre-trained vision-language models like CLIP have shown great potential in learning representations that are transferable across a wide range of downstream tasks. Different from the traditional representation learning that is based mostly on discretized labels, vision-language pre-training aligns images and texts in a common feature space, which allows zero-shot transfer to a downstream task via prompting , i.e., classification weights are synthesized from natural language describing classes of interest. In this work, we show that a major challenge for deploying such models in practice is prompt engineering, which requires domain expertise and is extremely time-consuming—one needs to spend a significant amount of time on words tuning since a slight change in wording could have a huge impact on performance. Inspired by recent advances in prompt learning research in natural language processing (NLP), we propose Context Optimization (CoOp) , a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition. Concretely, CoOp models a prompt's context words with learnable vectors while the entire pre-trained parameters are kept fixed. To handle different image recognition tasks, we provide two implementations of CoOp: unified context and class-specific context. Through extensive experiments on 11 datasets, we demonstrate that CoOp requires as few as one or two shots to beat hand-crafted prompts with a decent margin and is able to gain significant improvements over prompt engineering with more shots, e.g., with 16 shots the average gain is around 15% (with the highest reaching over 45%). Despite being a learning-based approach, CoOp achieves superb domain generalization performance compared with the zero-shot model using hand-crafted prompts.

Rethinking the Effect of Uninformative Class Name in Prompt Learning

Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP

DPL: Decoupled Prompt Learning for Vision-Language Models

Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification

Hierarchy-Aware Interactive Prompt Learning for Few-Shot Classification

In-context Prompt Learning for Test-time Vision Recognition with Frozen Vision-language Model

Unsupervised Prompt Learning for Vision-Language Models

Prototypical Contrastive Learning-based CLIP Fine-tuning for Object Re-identification

Concept-Guided Prompt Learning for Generalization in Vision-Language Models

AAPL: Adding Attributes to Prompt Learning for Vision-Language Models

APLe: Token-Wise Adaptive for Multi-Modal Prompt Learning

Tree of Attributes Prompt Learning for Vision-Language Models

Revisiting Prompt Pretraining of Vision-Language Models

Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners

Deeply Coupled Cross-Modal Prompt Learning

SYNC-CLIP: Synthetic Data Make CLIP Generalize Better in Data-Limited Scenarios

PRE: Vision-Language Prompt Learning with Reparameterization Encoder

Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Learning to Prompt for Vision-Language Models

Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning