MSG-Chart: Multimodal Scene Graph for ChartQA

Yue Dai,Soyeon Caren Han,Wei Liu
DOI: https://doi.org/10.1145/3627673.3679967
2024-08-09
Abstract:Automatic Chart Question Answering (ChartQA) is challenging due to the complex distribution of chart elements with patterns of the underlying data not explicitly displayed in charts. To address this challenge, we design a joint multimodal scene graph for charts to explicitly represent the relationships between chart elements and their patterns. Our proposed multimodal scene graph includes a visual graph and a textual graph to jointly capture the structural and semantical knowledge from the chart. This graph module can be easily integrated with different vision transformers as inductive bias. Our experiments demonstrate that incorporating the proposed graph module enhances the understanding of charts' elements' structure and semantics, thereby improving performance on publicly available benchmarks, ChartQA and OpenCQA.
Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges in automatic chart question answering (ChartQA). Specifically, due to the complex distribution of chart elements and the fact that data patterns are not clearly shown in the charts, automatic chart question answering is highly difficult. The paper points out that although existing models can handle some basic data extraction tasks, they perform poorly when dealing with questions that require understanding of visual attributes (such as color) or complex logical reasoning. In addition, existing methods are insufficient in capturing the spatial and semantic relationships between elements within the chart, resulting in an inability to fully understand the structural and semantic information of the chart. To solve these problems, the author proposes a joint multimodal scene graph, which includes a visual graph and a text graph, aiming to explicitly represent the relationships between chart elements and their patterns. In this way, the model can better capture the structure and semantic knowledge of the chart, thereby improving its performance on public datasets. Specifically, the model is able to: 1. **Capture structural information**: Capture the spatial relationships of each element in the chart through the visual graph. 2. **Capture semantic information**: Capture the semantic relationships of each element in the chart through the text graph. 3. **Enhance model performance**: Combine the multimodal scene graph module with different visual transformers as an inductive bias to improve the model's ability to understand charts, thus performing well in multiple benchmark tests. Through these improvements, the paper hopes to achieve better performance in chart understanding and question - answering tasks.