Abstract:Large multimodal models (LMMs) have exhibited proficiencies across many visual tasks. Although numerous well-known benchmarks exist to evaluate model performance, they increasingly have insufficient headroom. As such, there is a pressing need for a new generation of benchmarks challenging enough for the next generation of LMMs. One area that LMMs show potential is graph analysis, specifically, the tasks an analyst might typically perform when interpreting figures such as estimating the mean, intercepts or correlations of functions and data series. In this work, we introduce GRAB, a graph analysis benchmark, fit for current and future frontier LMMs. Our benchmark is entirely synthetic, ensuring high-quality, noise-free questions. GRAB is comprised of 2170 questions, covering four tasks and 23 graph properties. We evaluate 20 LMMs on GRAB, finding it to be a challenging benchmark, with the highest performing model attaining a score of just 21.7%. Finally, we conduct various ablations to investigate where the models succeed and struggle. We release GRAB to encourage progress in this important, growing domain.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that existing benchmarks are too simplistic for cutting - edge large - scale multimodal models (LMMs) and cannot fully evaluate the capabilities of these models. Specifically: 1. **Limitations of Existing Benchmarks**: As the performance of large - scale multimodal models improves, existing benchmarks gradually lose their effectiveness in differentiating model capabilities. For example, GPT - 4 has already scored very high in some widely - used benchmarks (such as MGSM, HumanEval, MMLU), which makes the scope of these benchmarks very limited. 2. **Importance of Chart and Graph Interpretation**: The interpretation of scientific and mathematical charts is at the core of many analysis tasks, especially when the underlying data is inaccessible (such as in documents, sketches, or other image types). Therefore, the ability to understand and reason about these charts is very important for multimodal models. 3. **Need for New Challenging Benchmarks**: To address the above issues, a new, more challenging benchmark is required to evaluate the capabilities of cutting - edge multimodal models. In particular, for tasks related to chart and graph analysis, these tasks usually include estimating means, intercepts, correlations, etc. To this end, the authors introduce **GRAB (Graph Analysis Benchmark)**, a fully synthetically - generated chart - analysis benchmark that contains 2,170 questions, covering four core tasks and 23 chart properties. GRAB aims to provide a challenging evaluation platform for current and future cutting - edge multimodal models. Experimental results show that even the best - performing model has an accuracy rate of only 21.7% on GRAB, indicating that this benchmark is indeed highly challenging. ### Main Contributions 1. **Introduction of GRAB**: A chart - analysis benchmark containing 2,170 questions. 2. **Comprehensive Evaluation of 20 Cutting - Edge Multimodal Models**: Conduct a detailed evaluation of these models through GRAB. 3. **Provide Insights into Model Advantages and Limitations**: Reveal the performance of models in different tasks and categories through multiple ablation experiments. ### Key Tasks Involved GRAB covers the following four core tasks: - **Properties**: Analyze the characteristics of a single function or data series. - **Functions**: Calculate the average properties of multiple functions. - **Series**: Estimate the means of specific properties of multiple data series. - **Transforms**: Determine the function properties after a series of transformations. The design of these tasks ensures the diversity and complexity of the benchmark, enabling a more comprehensive evaluation of the chart - analysis capabilities of multimodal models.

GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

Can Large Language Models Analyze Graphs like Professionals? A Benchmark, Datasets and Models

GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models

Multimodal Graph Benchmark

AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

GLBench: A Comprehensive Benchmark for Graph with Large Language Models

GraphEval2000: Benchmarking and Improving Large Language Models on Graph Datasets

BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data

SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation

Task Me Anything

GraphArena: Benchmarking Large Language Models on Graph Computational Problems

When Graph meets Multimodal: Benchmarking on Multimodal Attributed Graphs Learning

GraphextQA: A Benchmark for Evaluating Graph-Enhanced Large Language Models

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

VisionGraph: Leveraging Large Multimodal Models for Graph Theory Problems in Visual Context

VisGraphVar: A Benchmark Generator for Assessing Variability in Graph Analysis Using Large Vision-Language Models

PUB: Plot Understanding Benchmark and Dataset for Evaluating Large Language Models on Synthetic Visual Data Interpretation

Graph Robustness Benchmark: Benchmarking the Adversarial Robustness of Graph Machine Learning.

How Do Large Language Models Understand Graph Patterns? A Benchmark for Graph Pattern Comprehension