Abstract:With the rapid development of large language models (LLMs) and their integration into large multimodal models (LMMs), there has been impressive progress in zero-shot completion of user-oriented vision-language tasks. However, a gap remains in the domain of chart image understanding due to the distinct abstract components in charts. To address this, we introduce a large-scale MultiModal Chart Instruction (\textbf{MMC-Instruction}) dataset comprising 600k instances supporting diverse tasks and chart types. Leveraging this data, we develop MultiModal Chart Assistant (\textbf{MMCA}), an LMM that achieves state-of-the-art performance on existing chart QA benchmarks. Recognizing the need for a comprehensive evaluation of LMM chart understanding, we also propose a MultiModal Chart Benchmark (\textbf{MMC-Benchmark}), a comprehensive human-annotated benchmark with nine distinct tasks evaluating reasoning capabilities over charts. Extensive experiments on MMC-Benchmark reveal the limitations of existing LMMs on correctly interpreting charts, even for the most recent GPT-4V model. Our work provides an instruction-tuning methodology and benchmark to advance multimodal understanding of charts. Code and data are available at <a class="link-external link-https" href="https://github.com/FuxiaoLiu/MMC" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address key issues in the field of chart image understanding. Despite significant progress made by large language models (LLMs) and large multimodal models (LMMs) in zero-shot user-oriented vision-language tasks, there remains a noticeable gap in chart image understanding. This is because charts contain unique abstract elements, such as trend lines and color-coded legends, which convey specific data information, and existing multimodal models perform poorly in interpreting these complex contents. Specifically, the paper attempts to solve the following problems: 1. **Insufficient chart understanding capability of existing models**: Current open-source multimodal models perform well in handling natural scene images but poorly in understanding and interpreting chart images. This is mainly because chart images differ significantly from natural scene images in terms of structure and information expression. 2. **Lack of large-scale, diverse chart understanding datasets**: Existing chart understanding datasets are small in scale and lack diversity, making it difficult to adequately train and evaluate multimodal models on chart understanding tasks. 3. **Lack of comprehensive evaluation benchmarks**: There is currently a lack of a comprehensive evaluation benchmark to fully assess the performance of multimodal models on chart understanding tasks, especially across multiple subtasks. ### Solutions To address the above issues, the paper proposes the following solutions: 1. **Constructing a large-scale multimodal chart instruction dataset (MMC-Instruction)**: This dataset contains 600,000 instances, supports various tasks and chart types, and aims to improve the chart understanding capability of multimodal models through large-scale instruction tuning. 2. **Developing a multimodal chart assistant (MMCA)**: Based on the mPLUG-Owl model, a new multimodal model (MMCA) was developed by fine-tuning with the MMC-Instruction dataset, achieving state-of-the-art performance on existing chart question-answering benchmarks. 3. **Proposing a multimodal chart benchmark (MMC-Benchmark)**: This is a comprehensive manually annotated benchmark containing nine different tasks, used to evaluate the reasoning ability of multimodal models in chart understanding. The benchmark provides two quantitative evaluation methods, including free-form generation capability and multiple-choice question format chart understanding capability assessment. Through these methods, the paper not only improves the performance of multimodal models on chart understanding tasks but also provides important data and evaluation tools for future research and development.

MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning

Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning

ChartLlama: A Multimodal LLM for Chart Understanding and Generation

MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems

ChartBench: A Benchmark for Complex Visual Reasoning in Charts

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning

On Pre-training of Multimodal Language Models Customized for Chart Understanding

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

ChartInsights: Evaluating Multimodal Large Language Models for Low-Level Chart Question Answering

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning

Chart Understanding with Large Language Model

ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI