Abstract:Large Multimodal Models (LMMs) have achieved impressive success in visual understanding and reasoning, remarkably improving the performance of mathematical reasoning in a visual context. Yet, a challenging type of visual math lies in the multimodal graph theory problem, which demands that LMMs understand the graphical structures accurately and perform multi-step reasoning on the visual graph. Additionally, exploring multimodal graph theory problems will lead to more effective strategies in fields like biology, transportation, and robotics planning. To step forward in this direction, we are the first to design a benchmark named VisionGraph, used to explore the capabilities of advanced LMMs in solving multimodal graph theory problems. It encompasses eight complex graph problem tasks, from connectivity to shortest path problems. Subsequently, we present a Description-Program-Reasoning (DPR) chain to enhance the logical accuracy of reasoning processes through graphical structure description generation and algorithm-aware multi-step reasoning. Our extensive study shows that 1) GPT-4V outperforms Gemini Pro in multi-step graph reasoning; 2) All LMMs exhibit inferior perception accuracy for graphical structures, whether in zero/few-shot settings or with supervised fine-tuning (SFT), which further affects problem-solving performance; 3) DPR significantly improves the multi-step graph reasoning capabilities of LMMs and the GPT-4V (DPR) agent achieves SOTA performance.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the ability to use large multimodal models (LMMs) to solve graph - theory problems in the visual context. Specifically, the paper focuses on the following points:
1. **Graph Structure Comprehension Ability**: Different from ordinary images, visual graphs have a strong spatial structure, which is very suitable for examining the spatial comprehension ability of LMMs. The paper explores the performance of LMMs in node and edge identification.
2. **Influence of Supervised Fine - Tuning Methods**: LMMs are usually fine - tuned using open - domain image - text data and perform poorly when processing vertical - domain images (such as medical images). Therefore, the paper further fine - tunes LMMs using the constructed graph - instruction fine - tuning data and analyzes the effect of the training strategy.
3. **Analysis of Multistep Graph Reasoning Ability**: Although GPT - 4V performs excellently in complex visual - language reasoning scenarios, the paper also explores the ability of GPT - 4V to solve multimodal graph problems through multistep reasoning. To this end, the paper proposes a Description - Program - Reasoning (DPR) method, aiming to enhance the logic of the reasoning process.
### Main Contributions
1. **Propose the Multimodal Graph - Theory Problem Benchmark VisionGraph**: It is used to evaluate the graph structure comprehension and multistep reasoning abilities of LMMs in solving graph - theory problems in the visual context. To promote future research on graph - theory problems, the paper will release the benchmark VisionGraph and advanced prompting techniques.
2. **Empirical Research Reveals the Shortcomings of LMMs**: LMMs, including GPT - 4V and Gemini, have shortcomings in understanding graph structures and multistep multimodal reasoning, indicating that they have the potential to improve multistep reasoning and planning abilities in the context of visual graphs.
3. **Design the Graph - Problem - Solving Method DPR**: It enhances the multistep reasoning performance of LMMs through the interweaving of natural language and programming. The designed GPT - 4V (DPR) is a comprehensive multimodal agent that integrates complex task decomposition, small - model perception enhancement, code generation, and tool invocation.
### Experimental Setup
1. **Comparison Models**: The paper tests widely used powerful commercial LMMs (such as GPT - 4V, Gemini, and Qwen - Plus/Max) as well as open - source LMMs (such as MiniGPT - 4, InstructBLIP, LLaVA, and Qwen - VL).
2. **Evaluation Metrics**: The evaluation of graph - theory problems is based on three different sub - problems, each of which has specific evaluation criteria. For example, node identification evaluates the accuracy of node counting, and edge identification evaluates the correct rate and error rate of graph - edge representation.
3. **Implementation Details**: Model training is divided into two stages. In the first stage, training is carried out for 5 epochs using the AdamW optimizer. In the second stage, a VQA task with fine - grained edge information is introduced, and the data path and learning rate are adjusted.
### Experimental Results
1. **Graph Structure Comprehension Ability**: GPT - 4V outperforms Gemini in node identification and edge identification tasks, showing stronger spatial comprehension ability. However, all LMMs show a relatively high error rate in edge identification tasks, indicating that there is still room for improvement in spatial perception ability.
2. **Influence of Supervised Fine - Tuning Methods**: Introducing more graph - understanding data can significantly improve the accuracy of edge identification, especially in reducing the error rate. In addition, the few - shot setting also improves the graph - perception and reasoning accuracy of GPT - 4V.
3. **Improvement of Multistep Reasoning Ability**: The DPR method significantly improves the performance of GPT - 4V in multistep graph - reasoning tasks, especially in connectivity, cycle, and shortest - path problems.
In conclusion, through proposing the VisionGraph benchmark and the DPR method, this paper provides new tools and methods for the research of multimodal graph - theory problems, promoting the application and development of LMMs in this field.