Language Models as Black-Box Optimizers for Vision-Language Models

Shihong Liu,Zhiqiu Lin,Samuel Yu,Ryan Lee,Tiffany Ling,Deepak Pathak,Deva Ramanan

2024-05-14

Abstract:Vision-language models (VLMs) pre-trained on web-scale datasets have demonstrated remarkable capabilities on downstream tasks when fine-tuned with minimal data. However, many VLMs rely on proprietary data and are not open-source, which restricts the use of white-box approaches for fine-tuning. As such, we aim to develop a black-box approach to optimize VLMs through natural language prompts, thereby avoiding the need to access model parameters, feature embeddings, or even output logits. We propose employing chat-based LLMs to search for the best text prompt for VLMs. Specifically, we adopt an automatic hill-climbing procedure that converges to an effective prompt by evaluating the performance of current prompts and asking LLMs to refine them based on textual feedback, all within a conversational process without human-in-the-loop. In a challenging 1-shot image classification setup, our simple approach surpasses the white-box continuous prompting method (CoOp) by an average of 1.5% across 11 datasets including ImageNet. Our approach also outperforms both human-engineered and LLM-generated prompts. We highlight the advantage of conversational feedback that incorporates both positive and negative prompts, suggesting that LLMs can utilize the implicit gradient direction in textual feedback for a more efficient search. In addition, we find that the text prompts generated through our strategy are not only more interpretable but also transfer well across different VLM architectures in a black-box manner. Lastly, we apply our framework to optimize the state-of-the-art black-box VLM (DALL-E 3) for text-to-image generation, prompt inversion, and personalization.

Computation and Language,Computer Vision and Pattern Recognition,Machine Learning,Multimedia

What problem does this paper attempt to address?

This paper primarily discusses how to optimize Visual Language Models (VLMs) without accessing model parameters, feature embeddings, or output logarithms. As many VLMs rely on proprietary data and are not open-source, researchers propose a black-box approach to optimize VLMs using natural language prompts. They employ conversation-based Large-Language Models (LLMs) for automated "hill-climbing" search by evaluating the performance of the current prompt and improving it based on textual feedback, achieving an iterative process without human involvement. Specifically, the paper presents the use of chat-style LLMs, such as ChatGPT, to find the optimal textual prompt for optimizing VLMs, particularly for the image classification task of CLIP. By comparing the performance of different prompts, LLMs can be modified based on feedback to form effective prompts. The study finds that combining positive and negative prompts enhances search efficiency, as LLMs can learn the differences between effective and ineffective prompts from textual feedback, enabling more efficient search. Furthermore, the paper demonstrates the application of this method on state-of-the-art black-box VLMs like DALL-E 3, for text-to-image generation, prompt backpropagation, and personalization. The results show that this method surpasses white-box approaches like CoOp in low sample settings, and the generated textual prompts exhibit interpretability and architectural transferability. In conclusion, the paper addresses the problem of effectively optimizing visual language models in the absence of internal model information. It proposes a novel approach utilizing conversation-based language models for prompt optimization, improving model performance across various tasks.

Language Models as Black-Box Optimizers for Vision-Language Models

GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

Mutual Prompt Leaning for Vision Language Models

Learning to Prompt with Text Only Supervision for Vision-Language Models

Connecting the Dots: Collaborative Fine-tuning for Black-Box Vision-Language Models

LaViP:Language-Grounded Visual Prompts

IPO: Interpretable Prompt Optimization for Vision-Language Models

From Images to Textual Prompts: Zero-Shot Visual Question Answering with Frozen Large Language Models

LOBG:Less Overfitting for Better Generalization in Vision-Language Model

Large Language Models are Good Prompt Learners for Low-Shot Image Classification

Optimization of Prompt Learning via Multi-Knowledge Representation for Vision-Language Models

Learning to Prompt for Vision-Language Models

Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting

Black Box Few-Shot Adaptation for Vision-Language models

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs

Active Prompt Learning with Vision-Language Model Priors

TextRefiner: Internal Visual Feature as Efficient Refiner for Vision-Language Models Prompt Tuning

Towards Multimodal In-Context Learning for Vision & Language Models