LaViP:Language-Grounded Visual Prompts

Nilakshan Kunananthaseelan,Jing Zhang,Mehrtash Harandi
2023-12-18
Abstract:We introduce a language-grounded visual prompting method to adapt the visual encoder of vision-language models for downstream tasks. By capitalizing on language integration, we devise a parameter-efficient strategy to adjust the input of the visual encoder, eliminating the need to modify or add to the model's parameters. Due to this design choice, our algorithm can operate even in black-box scenarios, showcasing adaptability in situations where access to the model's parameters is constrained. We will empirically demonstrate that, compared to prior art, grounding visual prompts with language enhances both the accuracy and speed of adaptation. Moreover, our algorithm excels in base-to-novel class generalization, overcoming limitations of visual prompting and exhibiting the capacity to generalize beyond seen classes. We thoroughly assess and evaluate our method across a variety of image recognition datasets, such as EuroSAT, UCF101, DTD, and CLEVR, spanning different learning situations, including few-shot learning, base-to-novel class generalization, and transfer learning.
Computer Vision and Pattern Recognition,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the following issues: 1. **The problem of learning visual prompts (VPs) in a single modality**: In previous studies, visual prompts are usually processed independently of category semantic information, which is inconsistent with the human multimodal perception system. Therefore, if language information can be used to design visual prompts, can it improve the adaptability and generalization ability of the model? If so, what are the specific design issues? 2. **Efficient training problem**: Learning visual prompts requires a large number of iterations to achieve high-quality results, especially when adapting to high-dimensional visual inputs and asymmetric visual-language encoders. Can language help overcome this limitation? 3. **Generalization beyond seen categories**: Model reprogramming is essentially a transfer learning method that lacks explicit mechanisms to generalize to unseen categories. However, visual language models have strong zero-shot learning capabilities. Can a model reprogramming algorithm be designed through language to generalize to unseen categories during the adaptation process? 4. **Adaptation without accessing model parameters**: In some cases, due to ethical constraints or other reasons, it is not possible to access the structure and weights of the base model. In such cases, how can the model be adapted through APIs or other means while maintaining its generalization ability? To address the above challenges, the paper proposes the **Language-guided Visual Prompting (LaViP)** method, which generates visual prompts using language information to adjust the input of the visual encoder without modifying or adding model parameters. The main contributions of LaViP include: - Proposing for the first time a language-guided model reprogramming solution to adapt visual encoders for downstream tasks. - Proposing an effective mechanism that allows visual prompts to extend to unseen categories without retraining. - Conducting extensive evaluation and testing of the algorithm under three learning paradigms: few-shot learning, generalization to unseen categories, and transfer learning. The results show that LaViP significantly outperforms existing methods on multiple datasets. Through these contributions, LaViP not only improves the adaptability and generalization ability of the model but also demonstrates strong adaptability in black-box scenarios.