Chain of Thought Prompt Tuning in Vision Language Models

Jiaxin Ge,Hongyin Luo,Siyuan Qian,Yulu Gan,Jie Fu,Shanghang Zhang
DOI: https://doi.org/10.48550/arXiv.2304.07919
2023-06-17
Abstract:Language-Image Pre-training has demonstrated promising results on zero-shot and few-shot downstream tasks by prompting visual models with natural language prompts. However, most recent studies only use a single prompt for tuning, neglecting the inherent step-to-step cognitive reasoning process that humans conduct in complex task settings, for example, when processing images from unfamiliar domains. Chain of Thought is a simple and effective approximation to human reasoning process and has been proven useful for natural language processing (NLP) tasks. Based on this cognitive intuition, we believe that conducting effective reasoning is also an important problem in visual tasks, and a chain of thought could be a solution to this problem. In this work, we propose a novel chain of thought prompt tuning for vision-language modeling. Extensive experiments show that our method not only generalizes better in image classification tasks, has greater transferability beyond a single dataset, and has stronger domain generalization performance, but also performs much better in imagetext retrieval and visual question answering, which require more reasoning capabilities. We are the first to successfully adapt chain-of-thought prompting that combines visual and textual embeddings. We will release our codes
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the lack of effective reasoning capabilities in vision-language models when handling complex tasks. Specifically: 1. **Limitations of Existing Methods**: - Most current research only uses a single prompt for tuning, ignoring the step-by-step cognitive reasoning process humans use when dealing with complex tasks. - This single-prompt approach performs poorly when dealing with unfamiliar image domains, leading to insufficient generalization and cross-dataset transfer capabilities of the models. 2. **Introduction of Chain of Thought**: - Chain of Thought is a simple yet effective method that approximates the human reasoning process, which has been proven very useful in natural language processing (NLP) tasks. - The authors believe that effective reasoning is equally important in visual tasks, and Chain of Thought can serve as a solution to enhance the reasoning capabilities of the models. 3. **Proposed New Method**: - The authors propose a novel Chain of Thought prompt tuning method for vision-language modeling. - By designing multiple connected prompts, where each prompt receives information from the previous one and passes it to the next, the method simulates the step-by-step human reasoning process. - A dynamic chain controller is introduced to dynamically control the weights of the chain based on the input, adapting to different images and task requirements. - A set of Meta-Nets is designed, with each Meta-Net generating biases for specific steps to enhance the model's reasoning capabilities. 4. **Experimental Validation**: - The authors conducted extensive experiments on various tasks, including image classification, image-text retrieval, and visual question answering. - Experimental results show that this method not only has better generalization capabilities in image classification tasks but also performs excellently in tasks requiring more reasoning capabilities, such as image-text retrieval and visual question answering. ### Summary By introducing the Chain of Thought prompt tuning method, this paper addresses the lack of effective reasoning capabilities in existing vision-language models when handling complex tasks. This method not only improves the model's generalization and cross-dataset transfer capabilities but also achieves significant performance improvements across various tasks.