Abstract:Chain-of-thought (CoT) reasoning has exhibited impressive performance in language models for solving complex tasks and answering questions. However, many real-world questions require multi-modal information, such as text and images. Previous research on multi-modal CoT has primarily focused on extracting fixed image features from off-the-shelf vision models and then fusing them with text using attention mechanisms. This approach has limitations because these vision models were not designed for complex reasoning tasks and do not align well with language thoughts. To overcome this limitation, we introduce a novel approach for multi-modal CoT reasoning that utilizes latent space learning via diffusion processes to generate effective image features that align with language thoughts. Our method fuses image features and text representations at a deep level and improves the complex reasoning ability of multi-modal CoT. We demonstrate the efficacy of our proposed method on multi-modal ScienceQA and machine translation benchmarks, achieving state-of-the-art performance on ScienceQA. Overall, our approach offers a more robust and effective solution for multi-modal reasoning in language models, enhancing their ability to tackle complex real-world problems.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limitations of existing Chain - of - Thought (CoT) reasoning methods when dealing with multi - modal information. Specifically, existing multi - modal CoT models rely on fixed image features extracted by pre - trained visual models. These features are not flexible enough with respect to text queries, and these visual models are not optimized for complex reasoning tasks, resulting in inaccurate or unreasonable generated reasoning chains (rationale) and answers. To solve this problem, the authors propose a new method - Diffusion Process - enhanced Multi - Modal CoT (DPMM - CoT). This method generates effective image features aligned with language thinking in the latent space through the diffusion process, thereby deeply fusing image features and text representations and improving the complex reasoning ability of multi - modal CoT. ### Main contributions: 1. **Multi - modal latent space learning**: Utilize the diffusion process to generate effective image features aligned with language thinking in the latent space, solving the problem of image feature - text query mismatch in existing methods. 2. **Enhanced complex reasoning ability**: By deeply fusing image features and text representations, the complex reasoning ability of multi - modal CoT is improved, especially in tasks that require combining text and image information. 3. **Experimental verification**: Experiments were carried out on ScienceQA and multi - modal machine translation tasks, proving the effectiveness of the proposed method and achieving state - of - the - art performance. ### Method overview: 1. **Text encoder**: Use the Transformer model to encode text and generate text representations. 2. **Image feature extraction**: Use the variational auto - encoder (VAE) to extract the latent vector of the image and add noise through the diffusion process to generate image features with deep semantics. 3. **Feature fusion**: Through the attention mechanism and gating mechanism, deeply fuse the image features with the text representations to generate the final multi - modal representation. 4. **Text decoder**: Generate the reasoning chain and the final answer according to the fused multi - modal representation. ### Experimental results: - **ScienceQA dataset**: Achieved significantly better performance than existing methods on multiple subtasks, especially in tasks that require combining text and image information. - **Multi - modal machine translation task**: Also achieved a significant performance improvement on the Multi30K dataset, verifying the generality and effectiveness of the method. In conclusion, this paper effectively solves the limitations of existing multi - modal CoT methods by introducing diffusion - process - based multi - modal latent space learning and significantly improves the performance of multi - modal reasoning tasks.

Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models

Multimodal Chain-of-Thought Reasoning in Language Models

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed Large Language Model Signals for Science Question Answering

T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning Via Large Language Model Signals for Science Question Answering.

Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models

Towards a Unified Multimodal Reasoning Framework

Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding

DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models

CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models

Cantor: Inspiring Multimodal Chain-of-Thought of MLLM

Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Language Models

Training Large Language Models to Reason in a Continuous Latent Space

CoQ:AN Empirical Framework for Multi-hop Question Answering Empowered by Large Language Models

Chain of Images for Intuitively Reasoning

ChatCoT: Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models

CoF-CoT: Enhancing Large Language Models with Coarse-to-Fine Chain-of-Thought Prompting for Multi-domain NLU Tasks

Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings