VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

Zejun Li,Ruipu Luo,Jiwen Zhang,Minghui Qiu,Zhongyu Wei

2024-05-28

Abstract:While large multi-modal models (LMMs) have exhibited impressive capabilities across diverse tasks, their effectiveness in handling complex tasks has been limited by the prevailing single-step reasoning paradigm. To this end, this paper proposes VoCoT, a multi-step Visually grounded object-centric Chain-of-Thought reasoning framework tailored for inference with LMMs. VoCoT is characterized by two key features: (1) object-centric reasoning paths that revolve around cross-modal shared object-level information, and (2) visually grounded representation of object concepts in a multi-modal interleaved and aligned manner, which effectively bridges the modality gap within LMMs during long-term generation. Additionally, we construct an instruction dataset to facilitate LMMs in adapting to reasoning with VoCoT. By introducing VoCoT into the prevalent open-source LMM architecture, we introduce VolCano. With only 7B parameters and limited input resolution, VolCano demonstrates excellent performance across various scenarios, surpassing SOTA models, including GPT-4V, in tasks requiring complex reasoning. Our code, data and model will be available at <a class="link-external link-https" href="https://github.com/RupertLuo/VoCoT" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the limitations faced by current Large Multimodal Models (LMMs) when handling complex tasks, particularly the inefficiency caused by the single-step reasoning paradigm. Specifically: 1. **Enhancing Multi-step Reasoning Ability**: The paper proposes a framework named VOCOT, which is a Chain-of-Thought (CoT) reasoning framework based on visual objects, designed to enhance the multi-step reasoning ability of LMMs in multimodal environments. 2. **Visual Grounding Representation**: VOCOT achieves effective cross-modal information fusion and alignment by representing objects as tuples containing text descriptions, coordinates, and corresponding visual representations. 3. **Improving Interpretability**: Compared to traditional single-step reasoning, multi-step reasoning can better demonstrate the problem-solving process, thereby improving the interpretability of the model's output. 4. **Reducing Hallucinations**: By using visually grounded object representations, the framework reduces the likelihood of generating erroneous information during the reasoning process, thus enhancing the reliability of the model's inferences. In summary, the goal of this paper is to enhance the ability of LMMs to handle tasks requiring complex reasoning by introducing the VOCOT framework, and to improve the reliability and interpretability of their outputs.

VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Multimodal Chain-of-Thought Reasoning in Language Models

Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings

M$^3$CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought

Cantor: Inspiring Multimodal Chain-of-Thought of MLLM

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

Interleaved-Modal Chain-of-Thought

Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models

DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs

Self-prompted Chain-of-Thought on Large Language Models for Open-domain Multi-hop Reasoning

Improve Vision Language Model Chain-of-thought Reasoning

ChatCoT: Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models

KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning

Large Language Models are Visual Reasoning Coordinators

MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale