Abstract:Charts are common in literature across various scientific fields, conveying rich information easily accessible to readers. Current chart-related tasks focus on either chart perception that extracts information from the visual charts, or chart reasoning given the extracted data, e.g. in a tabular form. In this paper, we introduce StructChart, a novel framework that leverages Structured Triplet Representations (STR) to achieve a unified and label-efficient approach to chart perception and reasoning tasks, which is generally applicable to different downstream tasks, beyond the question-answering task as specifically studied in peer works. Specifically, StructChart first reformulates the chart data from the tubular form (linearized CSV) to STR, which can friendlily reduce the task gap between chart perception and reasoning. We then propose a Structuring Chart-oriented Representation Metric (SCRM) to quantitatively evaluate the chart perception task performance. To augment the training, we further explore the potential of Large Language Models (LLMs) to enhance the diversity in both chart visual style and statistical information. Extensive experiments on various chart-related tasks demonstrate the effectiveness and potential of a unified chart perception-reasoning paradigm to push the frontier of chart understanding.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper, "StructChart: On the Schema, Metric, and Augmentation for Visual Chart Understanding", aims to solve the following problems: 1. **The gap between perception tasks and reasoning tasks**: - Perception tasks mainly focus on extracting as accurate information as possible from charts, while ignoring the subtle relationships between data columns and rows. - Reasoning tasks need to consider complex data relationships in order to output answers or summarize chart information, especially for charts containing numerical and text information. 2. **Incomplete evaluation metrics**: - There is a lack of a comprehensive evaluation metric to evaluate chart perception performance from the perspective of structured information extraction. - Existing evaluation metrics only cover a single type of chart (such as bar charts, pie charts, line charts), and it is difficult to generalize to other chart types. 3. **Expensive chart corpora and annotations**: - Obtaining chart data in different fields and their annotations is a labor - intensive, time - consuming, and highly dependent on professionals in different fields. To solve the above problems, the author proposes **StructChart**, which is a unified and label - efficient joint perception and reasoning learning paradigm. Specifically, StructChart includes the following key components: - **Transformer - based Chart Information Extractor (CIE)**: Combining an image encoder and a text decoder, it converts a chart image into text in CSV format. - **Structured Triple Representation (STR)**: Structuring the intermediate CSV text into triple form to clearly express the complex positional relationships between headers and indexes. - **Structured Chart Representation Metric (SCRM)**: Designing a metric to evaluate the quality of the transformed triples, which is helpful for the subsequent reasoning process. - **Large Language Model (LLM) - based self - inspection data generation scheme**: Developing a new chart data simulation paradigm to enhance perception and reasoning abilities under zero - shot / few - shot conditions by increasing the number of simulated charts. ### Overview of the solution 1. **Perception stage**: - Propose CIE to utilize a pixel - level encoder and a text - level decoder, where the visual encoder is based on ViT. CIE dynamically adjusts the image resolution to maintain a constant number of patches and adds absolute position embeddings to handle images of different resolutions. - At this stage, the chart is converted from pixel - level to text - level linearized CSV tokens (LCT). 2. **Reasoning stage**: - Before reasoning, convert LCT to the designed STR so that the module can better understand the chart information. - This structuring process provides a better understanding of the relationships between entities within the chart. Considering the difficulty of downstream task evaluation, the reasoning process performs QA tasks in a zero - shot manner on various LLMs. 3. **Design of STR**: - Propose STR to effectively represent the positional relationships between row and column headers in the chart, solving the problem that the LCT format is sensitive to entity position changes. - STR can be extended to represent higher - order relationships in multi - chart and high - dimensional charts. 4. **Design of SCRM**: - Design SCRM to comprehensively evaluate the extracted chart information represented by STR. - SCRM evaluates at both the image - level and the dataset - level, including the edit distance of entities and the relative error of values, and designs three fine - grained tolerance levels to measure the similarity between predicted triples and real triples. 5. **Data augmentation**: - Introduce the LLM - based text - to - chart - level data generation scheme PlotAgent, including statistical data query and plot code generation, to ensure data diversity and style diversity. Through these methods, StructChart has achieved excellent performance in chart perception and reasoning tasks, and significantly improved chart perception performance under few - shot conditions.

StructChart: On the Schema, Metric, and Augmentation for Visual Chart Understanding

StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding

ChartReader: A Unified Framework for Chart Derendering and Comprehension without Heuristic Rules

Rethinking Comprehensive Benchmark for Chart Understanding: A Perspective from Scientific Literature

ChartThinker: A Contextual Chain-of-Thought Approach to Optimized Chart Summarization

ChartBench: A Benchmark for Complex Visual Reasoning in Charts

ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning

mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning

TinyChart: Efficient Chart Understanding with Program-of-Thoughts Learning and Visual Token Merging

ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning

MSG-Chart: Multimodal Scene Graph for ChartQA

TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

ChartAdapter: Large Vision-Language Model for Chart Summarization

AskChart: Universal Chart Understanding through Textual Enhancement

DCQA: Document-Level Chart Question Answering towards Complex Reasoning and Common-Sense Understanding

From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models

Distill Visual Chart Reasoning Ability from LLMs to MLLMs

Advancing Chart Question Answering with Robust Chart Component Recognition

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems

CHARTOM: A Visual Theory-of-Mind Benchmark for Multimodal Large Language Models