CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models

Zihui Cheng,Qiguang Chen,Jin Zhang,Hao Fei,Xiaocheng Feng,Wanxiang Che,Min Li,Libo Qin
2024-12-17
Abstract:Large Vision-Language Models (LVLMs) have recently demonstrated amazing success in multi-modal tasks, including advancements in Multi-modal Chain-of-Thought (MCoT) reasoning. Despite these successes, current benchmarks still follow a traditional paradigm with multi-modal input and text-modal output, which leads to significant drawbacks such as missing visual operations and vague expressions. Motivated by this, we introduce a novel Chain of Multi-modal Thought (CoMT) benchmark to address these limitations. Different from the traditional MCoT benchmark, CoMT requires both multi-modal input and multi-modal reasoning output, aiming to mimic human-like reasoning that inherently integrates visual operation. Specifically, CoMT consists of four categories: (1) Visual Creation, (2) Visual Deletion, (3) Visual Update, and (4) Visual Selection to comprehensively explore complex visual operations and concise expression in real scenarios. We evaluate various LVLMs and strategies on CoMT, revealing some key insights into the capabilities and limitations of the current approaches. We hope that CoMT can inspire more research on introducing multi-modal generation into the reasoning process.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve the problems existing in the current Multi - modal Chain - of - Thought (MCoT) benchmark tests, which are mainly reflected in the following two aspects: 1. **Missing visual operations**: Although the traditional MCoT benchmark tests can handle multi - modal inputs, their outputs are limited to the text mode. This single - modal output method cannot perform specific visual operations, thus limiting the performance of the model in complex visual tasks. For example, during the reasoning process, traditional methods can describe operations (such as "marking angles"), but cannot actually perform image processing steps, resulting in an incomplete reasoning process. 2. **Ambiguous expressions**: Text expressions have limitations in conveying visual reasoning conditions. For example, a description such as "∠1 = 40°" is imprecise without actual labeling, and cannot accurately reflect the mapping relationship between the angle and the measured value, thus leading to ambiguity in expression and loss of visual information. To solve the above problems, the paper introduces a new multi - modal chain - of - thought benchmark test - **Chain of Multi - modal Thought (CoMT)**. The main features of CoMT include: - **Multi - modal input and output**: CoMT not only requires the model to handle multi - modal inputs, but also requires the generation of multi - modal reasoning outputs, in order to simulate the way humans naturally integrate visual and text thinking during the reasoning process. - **Four - category evaluation**: CoMT contains four categories, namely: - **Visual Creation**: Evaluate the ability to generate images from text descriptions. - **Visual Deletion**: Evaluate the ability to remove irrelevant information from a given image. - **Visual Update**: Evaluate the ability to integrate and update an image while retaining previous information. - **Visual Selection**: Evaluate the ability to select specific visual features for image comparison. Through these improvements, CoMT aims to more comprehensively evaluate the capabilities of large visual - language models (LVLMs) in complex visual operations and concise expressions, thereby promoting the development of multi - modal reasoning technology.