CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

Zirui Wang,Mengzhou Xia,Luxi He,Howard Chen,Yitao Liu,Richard Zhu,Kaiqu Liang,Xindi Wu,Haotian Liu,Sadhika Malladi,Alexis Chevalier,Sanjeev Arora,Danqi Chen

2024-06-27

Abstract:Chart understanding plays a pivotal role when applying Multimodal Large Language Models (MLLMs) to real-world tasks such as analyzing scientific papers or financial reports. However, existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an over-optimistic measure of progress. We demonstrate that although open-source models can appear to outperform strong proprietary models on these benchmarks, a simple stress test with slightly different charts or questions can deteriorate performance by up to 34.5%. In this work, we propose CharXiv, a comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from arXiv papers. CharXiv includes two types of questions: 1) descriptive questions about examining basic chart elements and 2) reasoning questions that require synthesizing information across complex visual elements in the chart. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Our results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model (i.e., GPT-4o), which achieves 47.1% accuracy, and the strongest open-source model (i.e., InternVL Chat V1.5), which achieves 29.2%. All models lag far behind human performance of 80.5%, underscoring weaknesses in the chart understanding capabilities of existing MLLMs. We hope CharXiv facilitates future research on MLLM chart understanding by providing a more realistic and faithful measure of progress. Project page and leaderboard: <a class="link-external link-https" href="https://charxiv.github.io/" rel="external noopener nofollow">this https URL</a>

Computation and Language,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper addresses the issue that current Multimodal Large Language Models (MLLMs) lack evaluation in chart comprehension. Existing benchmark datasets often contain simplified and homogenized charts, as well as template-based questions, which lead to overly optimistic evaluation of model advancements. The study found that even though open-source models perform better than proprietary models on these benchmark tests, their performance can be significantly reduced under simple stress tests. To address this issue, the paper proposes CharXiv, a comprehensive evaluation suite consisting of 2323 natural, complex, and diverse real-world charts from arXiv papers. CharXiv includes descriptive questions and inference questions designed to test the models' comprehensive understanding of basic chart elements and complex visual elements. All questions are carefully selected, curated, and validated by human experts. Through the evaluation of multiple open-source and proprietary models, CharXiv reveals significant gaps between these models' inference skills and human performance. The paper aims to provide a more realistic and accurate measurement standard through CharXiv to facilitate future research on MLLMs' chart comprehension ability.

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

ChartBench: A Benchmark for Complex Visual Reasoning in Charts

ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning

ChartInsights: Evaluating Multimodal Large Language Models for Low-Level Chart Question Answering

Chart Understanding with Large Language Model

MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems

Distill Visual Chart Reasoning Ability from LLMs to MLLMs

TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation

EvoChart: A Benchmark and a Self-Training Approach Towards Real-World Chart Understanding

CHARTOM: A Visual Theory-of-Mind Benchmark for Multimodal Large Language Models

MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning

ChartGPT: Leveraging LLMs to Generate Charts from Abstract Natural Language

Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning

From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models

On Pre-training of Multimodal Language Models Customized for Chart Understanding

Charting the Future: Using Chart Question-Answering for Scalable Evaluation of LLM-Driven Data Visualizations

Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models

StructChart: On the Schema, Metric, and Augmentation for Visual Chart Understanding

ChartReader: A Unified Framework for Chart Derendering and Comprehension without Heuristic Rules