Abstract:Large Vision-Language Models (LVLMs) have recently demonstrated amazing success in multi-modal tasks, including advancements in Multi-modal Chain-of-Thought (MCoT) reasoning. Despite these successes, current benchmarks still follow a traditional paradigm with multi-modal input and text-modal output, which leads to significant drawbacks such as missing visual operations and vague expressions. Motivated by this, we introduce a novel Chain of Multi-modal Thought (CoMT) benchmark to address these limitations. Different from the traditional MCoT benchmark, CoMT requires both multi-modal input and multi-modal reasoning output, aiming to mimic human-like reasoning that inherently integrates visual operation. Specifically, CoMT consists of four categories: (1) Visual Creation, (2) Visual Deletion, (3) Visual Update, and (4) Visual Selection to comprehensively explore complex visual operations and concise expression in real scenarios. We evaluate various LVLMs and strategies on CoMT, revealing some key insights into the capabilities and limitations of the current approaches. We hope that CoMT can inspire more research on introducing multi-modal generation into the reasoning process.

What problem does this paper attempt to address?

This paper attempts to solve the problems existing in the current Multi - modal Chain - of - Thought (MCoT) benchmark tests, which are mainly reflected in the following two aspects: 1. **Missing visual operations**: Although the traditional MCoT benchmark tests can handle multi - modal inputs, their outputs are limited to the text mode. This single - modal output method cannot perform specific visual operations, thus limiting the performance of the model in complex visual tasks. For example, during the reasoning process, traditional methods can describe operations (such as "marking angles"), but cannot actually perform image processing steps, resulting in an incomplete reasoning process. 2. **Ambiguous expressions**: Text expressions have limitations in conveying visual reasoning conditions. For example, a description such as "∠1 = 40°" is imprecise without actual labeling, and cannot accurately reflect the mapping relationship between the angle and the measured value, thus leading to ambiguity in expression and loss of visual information. To solve the above problems, the paper introduces a new multi - modal chain - of - thought benchmark test - **Chain of Multi - modal Thought (CoMT)**. The main features of CoMT include: - **Multi - modal input and output**: CoMT not only requires the model to handle multi - modal inputs, but also requires the generation of multi - modal reasoning outputs, in order to simulate the way humans naturally integrate visual and text thinking during the reasoning process. - **Four - category evaluation**: CoMT contains four categories, namely: - **Visual Creation**: Evaluate the ability to generate images from text descriptions. - **Visual Deletion**: Evaluate the ability to remove irrelevant information from a given image. - **Visual Update**: Evaluate the ability to integrate and update an image while retaining previous information. - **Visual Selection**: Evaluate the ability to select specific visual features for image comparison. Through these improvements, CoMT aims to more comprehensively evaluate the capabilities of large visual - language models (LVLMs) in complex visual operations and concise expressions, thereby promoting the development of multi - modal reasoning technology.

CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models

M$^3$CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought

M^3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought

Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

Cantor: Inspiring Multimodal Chain-of-Thought of MLLM

CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

Multimodal Chain-of-Thought Reasoning in Language Models

Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings

CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs

MC-CoT: A Modular Collaborative CoT Framework for Zero-shot Medical-VQA with LLM and MLLM Integration

Interleaved-Modal Chain-of-Thought

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models

CONSTRUCTURE: Benchmarking CONcept STRUCTUre REasoning for Multimodal Large Language Models

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Chain of Images for Intuitively Reasoning

DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models

Compositional Chain-of-Thought Prompting for Large Multimodal Models

Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models