Abstract:Language-Image Pre-training has demonstrated promising results on zero-shot and few-shot downstream tasks by prompting visual models with natural language prompts. However, most recent studies only use a single prompt for tuning, neglecting the inherent step-to-step cognitive reasoning process that humans conduct in complex task settings, for example, when processing images from unfamiliar domains. Chain of Thought is a simple and effective approximation to human reasoning process and has been proven useful for natural language processing (NLP) tasks. Based on this cognitive intuition, we believe that conducting effective reasoning is also an important problem in visual tasks, and a chain of thought could be a solution to this problem. In this work, we propose a novel chain of thought prompt tuning for vision-language modeling. Extensive experiments show that our method not only generalizes better in image classification tasks, has greater transferability beyond a single dataset, and has stronger domain generalization performance, but also performs much better in imagetext retrieval and visual question answering, which require more reasoning capabilities. We are the first to successfully adapt chain-of-thought prompting that combines visual and textual embeddings. We will release our codes

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the lack of effective reasoning capabilities in vision-language models when handling complex tasks. Specifically: 1. **Limitations of Existing Methods**: - Most current research only uses a single prompt for tuning, ignoring the step-by-step cognitive reasoning process humans use when dealing with complex tasks. - This single-prompt approach performs poorly when dealing with unfamiliar image domains, leading to insufficient generalization and cross-dataset transfer capabilities of the models. 2. **Introduction of Chain of Thought**: - Chain of Thought is a simple yet effective method that approximates the human reasoning process, which has been proven very useful in natural language processing (NLP) tasks. - The authors believe that effective reasoning is equally important in visual tasks, and Chain of Thought can serve as a solution to enhance the reasoning capabilities of the models. 3. **Proposed New Method**: - The authors propose a novel Chain of Thought prompt tuning method for vision-language modeling. - By designing multiple connected prompts, where each prompt receives information from the previous one and passes it to the next, the method simulates the step-by-step human reasoning process. - A dynamic chain controller is introduced to dynamically control the weights of the chain based on the input, adapting to different images and task requirements. - A set of Meta-Nets is designed, with each Meta-Net generating biases for specific steps to enhance the model's reasoning capabilities. 4. **Experimental Validation**: - The authors conducted extensive experiments on various tasks, including image classification, image-text retrieval, and visual question answering. - Experimental results show that this method not only has better generalization capabilities in image classification tasks but also performs excellently in tasks requiring more reasoning capabilities, such as image-text retrieval and visual question answering. ### Summary By introducing the Chain of Thought prompt tuning method, this paper addresses the lack of effective reasoning capabilities in existing vision-language models when handling complex tasks. This method not only improves the model's generalization and cross-dataset transfer capabilities but also achieves significant performance improvements across various tasks.

Chain of Thought Prompt Tuning in Vision Language Models

Iteratively Prompt Pre-trained Language Models for Chain of Thought

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings

Chain-of-Thought in Large Language Models: Decoding, Projection, and Activation

The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task

Improve Vision Language Model Chain-of-thought Reasoning

Interleaved-Modal Chain-of-Thought

Chain-of-Thought Augmentation with Logit Contrast for Enhanced Reasoning in Language Models

Self-Harmonized Chain of Thought

Chain-Of-Thought Prompting Under Streaming Batch: A Case Study

Soft-Prompting with Graph-of-Thought for Multi-modal Representation Learning

Unveiling the Statistical Foundations of Chain-of-Thought Prompting Methods

Uncovering Latent Chain of Thought Vectors in Language Models

Supervised Chain of Thought

Synthetic Prompting: Generating Chain-of-Thought Demonstrations for Large Language Models

An automatically discovered chain-of-thought prompt generalizes to novel models and datasets

Pattern-Aware Chain-of-Thought Prompting in Large Language Models

See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning

Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models