Abstract:Large pretrained vision-language models like CLIP have shown promising generalization capability, but may struggle in specialized domains (e.g., satellite imagery) or fine-grained classification (e.g., car models) where the visual concepts are unseen or under-represented during pretraining. Prompt learning offers a parameter-efficient finetuning framework that can adapt CLIP to downstream tasks even when limited annotation data are available. In this paper, we improve prompt learning by distilling the textual knowledge from natural language prompts (either human- or LLM-generated) to provide rich priors for those under-represented concepts. We first obtain a prompt ``summary'' aligned to each input image via a learned prompt aggregator. Then we jointly train a prompt generator, optimized to produce a prompt embedding that stays close to the aggregated summary while minimizing task loss at the same time. We dub such prompt embedding as Aggregate-and-Adapted Prompt Embedding (AAPE). AAPE is shown to be able to generalize to different downstream data distributions and tasks, including vision-language understanding tasks (e.g., few-shot classification, VQA) and generation tasks (image captioning) where AAPE achieves competitive performance. We also show AAPE is particularly helpful to handle non-canonical and OOD examples. Furthermore, AAPE learning eliminates LLM-based inference cost as required by baselines, and scales better with data and LLM model size.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are as follows: Large pre - trained vision - language models (such as CLIP) may perform poorly in specific domains (e.g., satellite images) or fine - grained classification tasks (e.g., car model recognition), because these models have not seen or rarely seen these specific visual concepts during the pre - training stage. Moreover, when a small amount of labeled data is available, how to efficiently fine - tune CLIP to adapt to downstream tasks is also a challenge. To address these problems, the paper proposes a new prompt learning method, which provides rich prior knowledge for concepts that are not fully represented in pre - training by extracting text knowledge from natural language prompts (whether generated by humans or large language models (LLMs)). Specifically, this method includes the following steps: 1. **Generate natural language prompts**: - For object - centered image classification tasks, use large language models such as GPT - 3 to generate multiple natural language prompts describing each category. - For more complex tasks (such as VQA), use manually generated image captions to describe multi - object images and object interactions in their backgrounds. 2. **Learn to aggregate prompt embeddings**: - Aggregate multiple reference prompt embeddings into a "summary" prompt embedding aligned with the input image through a learned prompt aggregator. This step aims to filter out redundant and irrelevant information and ensure that the prompt embedding is highly relevant to the input image. 3. **Learn to generate Aggregate - and - Adapted Prompt Embedding (AAPE)**: - Jointly train a prompt generator so that the generated prompt embedding is close to the aggregated prompt summary and minimizes the task loss, thereby achieving effective adaptation to downstream tasks. Through this method, the paper shows that AAPE can perform well on different downstream data distributions and tasks, especially in few - shot classification, visual question answering (VQA), image caption generation, etc. In addition, AAPE can also handle atypical and OOD (out - of - distribution) samples, and completely eliminates the inference cost brought by large language models at test time. In summary, the main contributions of this paper are: - Proposing a new prompt learning method to improve the generalization ability of CLIP in downstream tasks by extracting text knowledge from natural language prompts. - AAPE performs well in various vision - language tasks, especially in few - shot and OOD settings. - The AAPE learning method has high data efficiency and shows better performance as the scale of large language models increases. In terms of formulas, the formulas involved in the paper are as follows: - **Distillation Loss**: \[ L_{\text{distill}}=\|h(x)-p_a\|_2^2 \] - **Task Loss for Image Classification**: \[ L_{\text{task}} = -\log p(y = c|x) \] - **Overall Loss Function**: \[ L=\lambda L_{\text{distill}}+L_{\text{task}} \] where \(\lambda\) is a weight parameter with a default value of 5. These formulas are used to guide the learning process of the prompt generator, ensuring that the generated prompt embedding can retain text knowledge and effectively adapt to downstream tasks.

Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP

APLe: Token-Wise Adaptive for Multi-Modal Prompt Learning

Learning to Prompt for Vision-Language Models

Concept-Guided Prompt Learning for Generalization in Vision-Language Models

Improving Zero-Shot Generalization for CLIP with Synthesized Prompts

Rethinking the Effect of Uninformative Class Name in Prompt Learning

Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification

Learning Domain Invariant Prompt for Vision-Language Models

Unsupervised Prompt Learning for Vision-Language Models

PRE: Vision-Language Prompt Learning with Reparameterization Encoder

CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning

Retaining and Enhancing Pre-trained Knowledge in Vision-Language Models with Prompt Ensembling

APPLeNet: Visual Attention Parameterized Prompt Learning for Few-Shot Remote Sensing Image Generalization using CLIP

In the Era of Prompt Learning with Vision-Language Models

SYNC-CLIP: Synthetic Data Make CLIP Generalize Better in Data-Limited Scenarios

Revisiting Prompt Pretraining of Vision-Language Models

Cascade Prompt Learning for Vision-Language Model Adaptation

AD-CLIP: Adapting Domains in Prompt Space Using CLIP

DPL: Decoupled Prompt Learning for Vision-Language Models

Multi-modal Attribute Prompting for Vision-Language Models