Abstract:In this work, we present SciGraphQA, a synthetic multi-turn question-answer dataset related to academic graphs. SciGraphQA is 13 times larger than ChartVQA, the previously largest chart-visual question-answering dataset. It is also the largest open-sourced chart VQA dataset with non-synthetic charts. To build our dataset, we selected 290,000 Computer Science or Machine Learning ArXiv papers published between 2010 and 2020, and then used Palm-2 to generate 295K samples of open-vocabulary multi-turn question-answering dialogues about the graphs. As context, we provided the text-only Palm-2 with paper title, abstract, paragraph mentioning the graph, and rich text contextual data from the graph itself, obtaining dialogues with an average 2.23 question-answer turns for each graph. We asked GPT-4 to assess the matching quality of our question-answer turns given the paper's context, obtaining an average rating of 8.7/10 on our 3K test set. We evaluated the 0-shot capability of the most popular MLLM models such as LLaVa, mPLUGowl, BLIP-2, and openFlamingo's on our dataset, finding LLaVA-13B being the most performant with a CIDEr score of 0.08. We further enriched the question prompts for LLAVA by including the serialized data tables extracted from the graphs using the DePlot model, boosting LLaVA's 0-shot CIDEr to 0.15. To verify the validity of our dataset, we also fine-tuned LLaVa using our dataset, reaching a substantially higher CIDEr score of 0.26. We anticipate further accuracy improvement by including segmentation mask tokens and leveraging larger LLM backbones coupled with emergent prompting techniques. Our code and data are open-sourced.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the understanding and question - answering of scientific charts. Specifically, the authors aim to create a large - scale multi - round question - answering dataset (SciGraphQA) to promote the understanding and interpretation of complex charts in academic literature. Through this dataset, they hope to provide a benchmark testing platform for multi - modal large - language models (MLLMs), evaluate the performance of these models when dealing with scientific charts, and promote the development of related technologies. ### Main Problems and Solutions 1. **Lack of large - scale scientific chart question - answering datasets** - **Problem**: Most of the existing visual question - answering (VQA) datasets are based on synthetic data or natural images and cannot fully reflect the uniqueness and complexity of scientific charts. - **Solution**: The authors constructed SciGraphQA, a large - scale multi - round question - answering dataset with 295,000 samples, specifically for real - world academic charts. These charts are from 290,000 arXiv papers in the fields of computer science and machine learning. 2. **Limitations of existing models in scientific chart understanding** - **Problem**: Current multi - modal large - language models (such as LLaVA, mPLUG - owl, BLIP - 2, etc.) perform poorly when dealing with scientific charts, especially in the zero - shot setting. - **Solution**: The authors not only evaluated the zero - shot performance of these models on SciGraphQA but also significantly improved their performance in scientific chart question - answering tasks by fine - tuning the LLaVA - 13B model. 3. **Improving the quality and diversity of question - answering dialogues** - **Problem**: The generated question - answering dialogues need to be of high quality and diverse to truly reflect the process of human - chart interaction. - **Solution**: The authors used Palm - 2 to generate multi - round dialogues and evaluated the quality of the dialogues through GPT - 4 to ensure that the generated question - answering dialogues are both accurate and helpful for understanding the chart content. ### Dataset Features - **Large - scale**: SciGraphQA is currently the largest open - source chart VQA dataset, 13 times larger than the previous ChartQA. - **Real - world charts**: The charts used in the dataset are from actual academic literature, not synthetic data. - **Rich context information**: Each question - answering dialogue provides text context related to the chart, including the paper title, abstract, and the paragraph that cites the chart. - **Multi - round dialogue**: On average, there are 2.23 question - answering rounds per chart, simulating the real - world interaction process. ### Evaluation and Improvement - **Zero - shot evaluation**: The authors evaluated the zero - shot performance of multiple MLLMs on SciGraphQA and found that LLaVA - 13B performed the best, with a CIDEr score of 0.08. - **Fine - tuning improvement**: By fine - tuning LLaVA - 13B on SciGraphQA, the CIDEr score was improved to 0.26, significantly better than the un - fine - tuned model. - **Data table enhancement**: Using the DePlot model to extract data tables from charts and use them as prompts further improved the zero - shot performance of the model. For example, the CIDEr score of LLaVA - 13B was increased from 0.08 to 0.15. In conclusion, this paper fills the gap in the field of scientific chart understanding and question - answering by constructing the SciGraphQA dataset and provides valuable resources and methodological support for future research.

SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs

S2M: Converting Single-Turn to Multi-Turn Datasets for Conversational Question Answering

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers

SciQAG: A Framework for Auto-Generated Science Question Answering Dataset with Fine-grained Evaluation

GRS-QA -- Graph Reasoning-Structured Question Answering Dataset

GraphextQA: A Benchmark for Evaluating Graph-Enhanced Large Language Models

SciDQA: A Deep Reading Comprehension Dataset over Scientific Papers

DCQA: Document-Level Chart Question Answering towards Complex Reasoning and Common-Sense Understanding

'Just because you are right, doesn't mean I am wrong': Overcoming a Bottleneck in the Development and Evaluation of Open-Ended Visual Question Answering (VQA) Tasks

Charting the Future: Using Chart Question-Answering for Scalable Evaluation of LLM-Driven Data Visualizations

Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models

ChartInsights: Evaluating Multimodal Large Language Models for Low-Level Chart Question Answering

Interpretable medical image Visual Question Answering via multi-modal relationship graph learning

StoryQA : Story Grounded Question Answering Dataset

Understanding the Role of Scene Graphs in Visual Question Answering

Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation

G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering

CR-LT-KGQA: A Knowledge Graph Question Answering Dataset Requiring Commonsense Reasoning and Long-Tail Knowledge

Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA

FanOutQA: A Multi-Hop, Multi-Document Question Answering Benchmark for Large Language Models

RealCQA: Scientific Chart Question Answering as a Test-bed for First-Order Logic