MSG-Chart: Multimodal Scene Graph for ChartQA

Yue Dai,Soyeon Caren Han,Wei Liu

DOI: https://doi.org/10.1145/3627673.3679967

2024-08-09

Abstract:Automatic Chart Question Answering (ChartQA) is challenging due to the complex distribution of chart elements with patterns of the underlying data not explicitly displayed in charts. To address this challenge, we design a joint multimodal scene graph for charts to explicitly represent the relationships between chart elements and their patterns. Our proposed multimodal scene graph includes a visual graph and a textual graph to jointly capture the structural and semantical knowledge from the chart. This graph module can be easily integrated with different vision transformers as inductive bias. Our experiments demonstrate that incorporating the proposed graph module enhances the understanding of charts' elements' structure and semantics, thereby improving performance on publicly available benchmarks, ChartQA and OpenCQA.

Computation and Language,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges in automatic chart question answering (ChartQA). Specifically, due to the complex distribution of chart elements and the fact that data patterns are not clearly shown in the charts, automatic chart question answering is highly difficult. The paper points out that although existing models can handle some basic data extraction tasks, they perform poorly when dealing with questions that require understanding of visual attributes (such as color) or complex logical reasoning. In addition, existing methods are insufficient in capturing the spatial and semantic relationships between elements within the chart, resulting in an inability to fully understand the structural and semantic information of the chart. To solve these problems, the author proposes a joint multimodal scene graph, which includes a visual graph and a text graph, aiming to explicitly represent the relationships between chart elements and their patterns. In this way, the model can better capture the structure and semantic knowledge of the chart, thereby improving its performance on public datasets. Specifically, the model is able to: 1. **Capture structural information**: Capture the spatial relationships of each element in the chart through the visual graph. 2. **Capture semantic information**: Capture the semantic relationships of each element in the chart through the text graph. 3. **Enhance model performance**: Combine the multimodal scene graph module with different visual transformers as an inductive bias to improve the model's ability to understand charts, thus performing well in multiple benchmark tests. Through these improvements, the paper hopes to achieve better performance in chart understanding and question - answering tasks.

MSG-Chart: Multimodal Scene Graph for ChartQA

mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning

MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems

Advancing Chart Question Answering with Robust Chart Component Recognition

Understanding the Role of Scene Graphs in Visual Question Answering

ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning

GoT-CQA: Graph-of-Thought Guided Compositional Reasoning for Chart Question Answering

Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning

ChartInsights: Evaluating Multimodal Large Language Models for Low-Level Chart Question Answering

StructChart: On the Schema, Metric, and Augmentation for Visual Chart Understanding

VProChart: Answering Chart Question through Visual Perception Alignment Agent and Programmatic Solution Reasoning

An Empirical Study on Leveraging Scene Graphs for Visual Question Answering

ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning

Chart Understanding with Large Language Model

CHARTOM: A Visual Theory-of-Mind Benchmark for Multimodal Large Language Models

DCQA: Document-Level Chart Question Answering towards Complex Reasoning and Common-Sense Understanding

StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding

Chart Question Answering: State of the Art and Future Directions

Multimodal Graph Transformer for Multimodal Question Answering

OpenCQA: Open-ended Question Answering with Charts

Enhancing Question Answering on Charts Through Effective Pre-training Tasks