PRE: Vision-Language Prompt Learning with Reparameterization Encoder

Thi Minh Anh Pham,An Duc Nguyen,Cephas Svosve,Vasileios Argyriou,Georgios Tzimiropoulos

2024-09-15

Abstract:Large pre-trained vision-language models such as CLIP have demonstrated great potential in zero-shot transferability to downstream tasks. However, to attain optimal performance, the manual selection of prompts is necessary to improve alignment between the downstream image distribution and the textual class descriptions. This manual prompt engineering is the major challenge for deploying such models in practice since it requires domain expertise and is extremely time-consuming. To avoid non-trivial prompt engineering, recent work Context Optimization (CoOp) introduced the concept of prompt learning to the vision domain using learnable textual tokens. While CoOp can achieve substantial improvements over manual prompts, its learned context is worse generalizable to wider unseen classes within the same dataset. In this work, we present Prompt Learning with Reparameterization Encoder (PRE) - a simple and efficient method that enhances the generalization ability of the learnable prompt to unseen classes while maintaining the capacity to learn Base classes. Instead of directly optimizing the prompts, PRE employs a prompt encoder to reparameterize the input prompt embeddings, enhancing the exploration of task-specific knowledge from few-shot samples. Experiments and extensive ablation studies on 8 benchmarks demonstrate that our approach is an efficient method for prompt learning. Specifically, PRE achieves a notable enhancement of 5.60% in average accuracy on New classes and 3% in Harmonic mean compared to CoOp in the 16-shot setting, all achieved within a good training time.

Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively adapt to the domain - specific data in downstream tasks in vision - language foundation models, especially in the case of a few samples, to improve the generalization ability of the model for unseen data categories. Specifically, the paper points out that although large - scale vision - language foundation models (such as CLIP) show great potential in zero - shot transfer learning, manual prompt engineering is the main challenge in their practical deployment, because it requires domain - specific knowledge and a great deal of time. In addition, although existing soft - prompt optimization methods (such as CoOp) improve the effect of manual prompts to a certain extent, they perform poorly when generalizing to a wider range of unseen categories within the same data set. Therefore, this paper proposes a new method - Prompt Learning with Reparameterization Encoder (PRE), aiming to enhance the generalization ability of learning prompts by re - parameterizing the input prompt embeddings, so as to better handle unseen data categories.

PRE: Vision-Language Prompt Learning with Reparameterization Encoder

Learning to Prompt for Vision-Language Models

Revisiting Prompt Pretraining of Vision-Language Models

Learning Domain Invariant Prompt for Vision-Language Models

Unsupervised Prompt Learning for Vision-Language Models

MaPLe: Multi-modal Prompt Learning

CoPL: Contextual Prompt Learning for Vision-Language Understanding

Consistency-guided Prompt Learning for Vision-Language Models

Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model

Concept-Guided Prompt Learning for Generalization in Vision-Language Models

Read-only Prompt Optimization for Vision-Language Few-shot Learning

Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models

IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning

Mutual Prompt Leaning for Vision Language Models

Retaining and Enhancing Pre-trained Knowledge in Vision-Language Models with Prompt Ensembling

Progressive Visual Prompt Learning with Contrastive Feature Re-formation

Learning Expressive Prompting With Residuals for Vision Transformers

APLe: Token-Wise Adaptive for Multi-Modal Prompt Learning

Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP

Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification

SEP: Self-Enhanced Prompt Tuning for Visual-Language Model