First Multi-Dimensional Evaluation of Flowchart Comprehension for Multimodal Large Language Models

Enming Zhang,Ruobing Yao,Huanyong Liu,Junhui Yu,Jiale Wang

2024-06-18

Abstract:With the development of Multimodal Large Language Models (MLLMs) technology, its general capabilities are increasingly powerful. To evaluate the various abilities of MLLMs, numerous evaluation systems have emerged. But now there is still a lack of a comprehensive method to evaluate MLLMs in the tasks related to flowcharts, which are very important in daily life and work. We propose the first comprehensive method, FlowCE, to assess MLLMs across various dimensions for tasks related to flowcharts. It encompasses evaluating MLLMs' abilities in Reasoning, Localization Recognition, Information Extraction, Logical Verification, and Summarization on flowcharts. However, we find that even the GPT4o model achieves only a score of 56.63. Among open-source models, Phi-3-Vision obtained the highest score of 49.97. We hope that FlowCE can contribute to future research on MLLMs for tasks based on flowcharts. \url{<a class="link-external link-https" href="https://github.com/360AILAB-NLP/FlowCE" rel="external noopener nofollow">this https URL</a>} \end{abstract}

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the lack of a comprehensive evaluation method for current Multimodal Large Language Models (MLLMs) in understanding flowcharts. Although the existing multiple evaluation systems can measure the capabilities of MLLMs from different perspectives, these systems fail to comprehensively evaluate the ability of MLLMs to understand flowcharts in real - world scenarios. Therefore, the paper proposes a new benchmark, FlowCE, which aims to comprehensively evaluate the ability of MLLMs to understand flowcharts from multiple dimensions (such as reasoning, information extraction, localization recognition, summarization, and logical verification). Specifically, the paper points out that currently there is no comprehensive evaluation benchmark that can comprehensively evaluate the ability of MLLMs to understand flowcharts from multiple perspectives, which hinders the development of methods for using MLLMs to understand and analyze flowcharts in an open environment. To this end, the paper proposes a new benchmark, FlowCE, which comprehensively evaluates for the first time the ability of MLLMs to understand flowcharts in real - world scenarios. FlowCE covers five evaluation dimensions: Reasoning, Information Extraction, Localization Recognition, Summarization, and Logical Verification, and evaluates the performance of the model on these tasks by designing diverse question - answer pairs. In addition, the paper also conducts extensive evaluations on mainstream open - source and proprietary MLLMs, and through detailed performance analysis, discovers the advantages and limitations of these models in understanding flowcharts, providing improvement suggestions for future research and development.

First Multi-Dimensional Evaluation of Flowchart Comprehension for Multimodal Large Language Models

FlowLearn: Evaluating Large Vision-Language Models on Flowchart Understanding

Visualization Literacy of Multimodal Large Language Models: A Comparative Study

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems

ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

ChartInsights: Evaluating Multimodal Large Language Models for Low-Level Chart Question Answering

MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning

A Survey on Evaluation of Multimodal Large Language Models

ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation

UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model

FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts

Exploring Graph Structure Comprehension Ability of Multimodal Large Language Models: Case Studies

MATEval: A Multi-Agent Discussion Framework for Advancing Open-Ended Text Evaluation

CMMLU: Measuring massive multitask language understanding in Chinese

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

ChartBench: A Benchmark for Complex Visual Reasoning in Charts

Chart Understanding with Large Language Model

Needle In A Multimodal Haystack

ChEF: A Comprehensive Evaluation Framework for Standardized Assessment of Multimodal Large Language Models