Abstract:We introduce a language-grounded visual prompting method to adapt the visual encoder of vision-language models for downstream tasks. By capitalizing on language integration, we devise a parameter-efficient strategy to adjust the input of the visual encoder, eliminating the need to modify or add to the model's parameters. Due to this design choice, our algorithm can operate even in black-box scenarios, showcasing adaptability in situations where access to the model's parameters is constrained. We will empirically demonstrate that, compared to prior art, grounding visual prompts with language enhances both the accuracy and speed of adaptation. Moreover, our algorithm excels in base-to-novel class generalization, overcoming limitations of visual prompting and exhibiting the capacity to generalize beyond seen classes. We thoroughly assess and evaluate our method across a variety of image recognition datasets, such as EuroSAT, UCF101, DTD, and CLEVR, spanning different learning situations, including few-shot learning, base-to-novel class generalization, and transfer learning.

What problem does this paper attempt to address?

The paper attempts to address the following issues: 1. **The problem of learning visual prompts (VPs) in a single modality**: In previous studies, visual prompts are usually processed independently of category semantic information, which is inconsistent with the human multimodal perception system. Therefore, if language information can be used to design visual prompts, can it improve the adaptability and generalization ability of the model? If so, what are the specific design issues? 2. **Efficient training problem**: Learning visual prompts requires a large number of iterations to achieve high-quality results, especially when adapting to high-dimensional visual inputs and asymmetric visual-language encoders. Can language help overcome this limitation? 3. **Generalization beyond seen categories**: Model reprogramming is essentially a transfer learning method that lacks explicit mechanisms to generalize to unseen categories. However, visual language models have strong zero-shot learning capabilities. Can a model reprogramming algorithm be designed through language to generalize to unseen categories during the adaptation process? 4. **Adaptation without accessing model parameters**: In some cases, due to ethical constraints or other reasons, it is not possible to access the structure and weights of the base model. In such cases, how can the model be adapted through APIs or other means while maintaining its generalization ability? To address the above challenges, the paper proposes the **Language-guided Visual Prompting (LaViP)** method, which generates visual prompts using language information to adjust the input of the visual encoder without modifying or adding model parameters. The main contributions of LaViP include: - Proposing for the first time a language-guided model reprogramming solution to adapt visual encoders for downstream tasks. - Proposing an effective mechanism that allows visual prompts to extend to unseen categories without retraining. - Conducting extensive evaluation and testing of the algorithm under three learning paradigms: few-shot learning, generalization to unseen categories, and transfer learning. The results show that LaViP significantly outperforms existing methods on multiple datasets. Through these contributions, LaViP not only improves the adaptability and generalization ability of the model but also demonstrates strong adaptability in black-box scenarios.

LaViP:Language-Grounded Visual Prompts

VPA: Fully Test-Time Visual Prompt Adaptation

Mutual Prompt Leaning for Vision Language Models

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

Fine-Grained Visual Prompt Learning of Vision-Language Models for Image Recognition

Towards Robust and Accurate Visual Prompting

LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models

Visual In-Context Prompting

Generalizable Prompt Tuning for Vision-Language Models

SA$^2$VP: Spatially Aligned-and-Adapted Visual Prompt

LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models

Language Models as Black-Box Optimizers for Vision-Language Models

Unsupervised Prompt Learning for Vision-Language Models

Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting

Exploring Visual Prompts for Adapting Large-Scale Models

Revisiting Prompt Pretraining of Vision-Language Models

Explicit Visual Prompting for Universal Foreground Segmentations

Attention Prompting on Image for Large Vision-Language Models

Learning to Prompt with Text Only Supervision for Vision-Language Models

IPO: Interpretable Prompt Optimization for Vision-Language Models