GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

Jonathan Roberts,Kai Han,Samuel Albanie
2024-08-30
Abstract:Large multimodal models (LMMs) have exhibited proficiencies across many visual tasks. Although numerous well-known benchmarks exist to evaluate model performance, they increasingly have insufficient headroom. As such, there is a pressing need for a new generation of benchmarks challenging enough for the next generation of LMMs. One area that LMMs show potential is graph analysis, specifically, the tasks an analyst might typically perform when interpreting figures such as estimating the mean, intercepts or correlations of functions and data series. In this work, we introduce GRAB, a graph analysis benchmark, fit for current and future frontier LMMs. Our benchmark is entirely synthetic, ensuring high-quality, noise-free questions. GRAB is comprised of 2170 questions, covering four tasks and 23 graph properties. We evaluate 20 LMMs on GRAB, finding it to be a challenging benchmark, with the highest performing model attaining a score of just 21.7%. Finally, we conduct various ablations to investigate where the models succeed and struggle. We release GRAB to encourage progress in this important, growing domain.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that existing benchmarks are too simplistic for cutting - edge large - scale multimodal models (LMMs) and cannot fully evaluate the capabilities of these models. Specifically: 1. **Limitations of Existing Benchmarks**: As the performance of large - scale multimodal models improves, existing benchmarks gradually lose their effectiveness in differentiating model capabilities. For example, GPT - 4 has already scored very high in some widely - used benchmarks (such as MGSM, HumanEval, MMLU), which makes the scope of these benchmarks very limited. 2. **Importance of Chart and Graph Interpretation**: The interpretation of scientific and mathematical charts is at the core of many analysis tasks, especially when the underlying data is inaccessible (such as in documents, sketches, or other image types). Therefore, the ability to understand and reason about these charts is very important for multimodal models. 3. **Need for New Challenging Benchmarks**: To address the above issues, a new, more challenging benchmark is required to evaluate the capabilities of cutting - edge multimodal models. In particular, for tasks related to chart and graph analysis, these tasks usually include estimating means, intercepts, correlations, etc. To this end, the authors introduce **GRAB (Graph Analysis Benchmark)**, a fully synthetically - generated chart - analysis benchmark that contains 2,170 questions, covering four core tasks and 23 chart properties. GRAB aims to provide a challenging evaluation platform for current and future cutting - edge multimodal models. Experimental results show that even the best - performing model has an accuracy rate of only 21.7% on GRAB, indicating that this benchmark is indeed highly challenging. ### Main Contributions 1. **Introduction of GRAB**: A chart - analysis benchmark containing 2,170 questions. 2. **Comprehensive Evaluation of 20 Cutting - Edge Multimodal Models**: Conduct a detailed evaluation of these models through GRAB. 3. **Provide Insights into Model Advantages and Limitations**: Reveal the performance of models in different tasks and categories through multiple ablation experiments. ### Key Tasks Involved GRAB covers the following four core tasks: - **Properties**: Analyze the characteristics of a single function or data series. - **Functions**: Calculate the average properties of multiple functions. - **Series**: Estimate the means of specific properties of multiple data series. - **Transforms**: Determine the function properties after a series of transformations. The design of these tasks ensures the diversity and complexity of the benchmark, enabling a more comprehensive evaluation of the chart - analysis capabilities of multimodal models.