Abstract:With the advent of large language models(LLMs) enhanced by the chain-of-thought(CoT) methodology, visual reasoning problem is usually decomposed into manageable sub-tasks and tackled sequentially with various external tools. However, such a paradigm faces the challenge of the potential "determining hallucinations" in decision-making due to insufficient visual information and the limitation of low-level perception tools that fail to provide abstract summaries necessary for comprehensive reasoning. We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks. This paper delves into the realm of multimodal CoT to solve intricate visual reasoning tasks with multimodal large language models(MLLMs) and their cognitive capability. To this end, we propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture. Cantor first acts as a decision generator and integrates visual inputs to analyze the image and problem, ensuring a closer alignment with the actual context. Furthermore, Cantor leverages the advanced cognitive functions of MLLMs to perform as multifaceted experts for deriving higher-level information, enhancing the CoT generation process. Our extensive experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance across two complex visual reasoning datasets, without necessitating fine-tuning or ground-truth rationales. Project Page:

What problem does this paper attempt to address?

The paper aims to address two main issues in Multimodal Chain-of-Thought (CoT) for visual reasoning tasks: 1. **Perceptual Illusion in the Decision-Making Stage**: Existing methods typically input pure text into large language models (LLMs) without considering visual contextual information. This approach may lead to biases or errors in the model's understanding of the problem, thereby affecting the accuracy of subsequent reasoning. For example, in the absence of image context, the model may fail to accurately grasp the specific meaning of the question, leading to irrelevant decisions. 2. **Insufficient Information Hierarchy in the Execution Stage**: Current multimodal CoT methods rely on external tools to complete tasks during the execution stage, but these tools often can only extract low-level visual information (such as object positions), making it difficult to provide high-level abstract information for complex reasoning. This not only increases the burden on the model to handle long-text reasoning but also complicates the entire process. To address the above issues, the paper proposes a novel multimodal CoT framework named Cantor. This framework improves the performance of visual reasoning tasks by integrating visual information with logical reasoning. Specifically, Cantor introduces visual information during the decision-making generation stage to ensure more reasonable decisions and utilizes a single multimodal large language model (MLLM) to play multiple expert roles to directly obtain high-level information, thereby enhancing the CoT generation process. Experimental results show that Cantor significantly improves performance on two complex visual reasoning datasets without the need for fine-tuning or using real labels.

Cantor: Inspiring Multimodal Chain-of-Thought of MLLM

Concise and Organized Perception Facilitates Large Language Models for Deductive Reasoning.

Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination

Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models

Multimodal Chain-of-Thought Reasoning in Language Models

Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

Chain of Images for Intuitively Reasoning

Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models

Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings

M$^3$CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought

Interleaved-Modal Chain-of-Thought

Compositional Chain-of-Thought Prompting for Large Multimodal Models

VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models

MaTCR: Modality-Aligned Thought Chain Reasoning for Multimodal Task-Oriented Dialogue Generation

The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large Language Models

ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Enhancing Advanced Visual Reasoning Ability of Large Language Models