MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning

Fuxiao Liu,Xiaoyang Wang,Wenlin Yao,Jianshu Chen,Kaiqiang Song,Sangwoo Cho,Yaser Yacoob,Dong Yu
2024-04-15
Abstract:With the rapid development of large language models (LLMs) and their integration into large multimodal models (LMMs), there has been impressive progress in zero-shot completion of user-oriented vision-language tasks. However, a gap remains in the domain of chart image understanding due to the distinct abstract components in charts. To address this, we introduce a large-scale MultiModal Chart Instruction (\textbf{MMC-Instruction}) dataset comprising 600k instances supporting diverse tasks and chart types. Leveraging this data, we develop MultiModal Chart Assistant (\textbf{MMCA}), an LMM that achieves state-of-the-art performance on existing chart QA benchmarks. Recognizing the need for a comprehensive evaluation of LMM chart understanding, we also propose a MultiModal Chart Benchmark (\textbf{MMC-Benchmark}), a comprehensive human-annotated benchmark with nine distinct tasks evaluating reasoning capabilities over charts. Extensive experiments on MMC-Benchmark reveal the limitations of existing LMMs on correctly interpreting charts, even for the most recent GPT-4V model. Our work provides an instruction-tuning methodology and benchmark to advance multimodal understanding of charts. Code and data are available at <a class="link-external link-https" href="https://github.com/FuxiaoLiu/MMC" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address key issues in the field of chart image understanding. Despite significant progress made by large language models (LLMs) and large multimodal models (LMMs) in zero-shot user-oriented vision-language tasks, there remains a noticeable gap in chart image understanding. This is because charts contain unique abstract elements, such as trend lines and color-coded legends, which convey specific data information, and existing multimodal models perform poorly in interpreting these complex contents. Specifically, the paper attempts to solve the following problems: 1. **Insufficient chart understanding capability of existing models**: Current open-source multimodal models perform well in handling natural scene images but poorly in understanding and interpreting chart images. This is mainly because chart images differ significantly from natural scene images in terms of structure and information expression. 2. **Lack of large-scale, diverse chart understanding datasets**: Existing chart understanding datasets are small in scale and lack diversity, making it difficult to adequately train and evaluate multimodal models on chart understanding tasks. 3. **Lack of comprehensive evaluation benchmarks**: There is currently a lack of a comprehensive evaluation benchmark to fully assess the performance of multimodal models on chart understanding tasks, especially across multiple subtasks. ### Solutions To address the above issues, the paper proposes the following solutions: 1. **Constructing a large-scale multimodal chart instruction dataset (MMC-Instruction)**: This dataset contains 600,000 instances, supports various tasks and chart types, and aims to improve the chart understanding capability of multimodal models through large-scale instruction tuning. 2. **Developing a multimodal chart assistant (MMCA)**: Based on the mPLUG-Owl model, a new multimodal model (MMCA) was developed by fine-tuning with the MMC-Instruction dataset, achieving state-of-the-art performance on existing chart question-answering benchmarks. 3. **Proposing a multimodal chart benchmark (MMC-Benchmark)**: This is a comprehensive manually annotated benchmark containing nine different tasks, used to evaluate the reasoning ability of multimodal models in chart understanding. The benchmark provides two quantitative evaluation methods, including free-form generation capability and multiple-choice question format chart understanding capability assessment. Through these methods, the paper not only improves the performance of multimodal models on chart understanding tasks but also provides important data and evaluation tools for future research and development.