M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models

Chuhan Li,Ziyao Shangguan,Yilun Zhao,Deyuan Li,Yixin Liu,Arman Cohan

2024-11-07

Abstract:Existing benchmarks for evaluating foundation models mainly focus on single-document, text-only tasks. However, they often fail to fully capture the complexity of research workflows, which typically involve interpreting non-textual data and gathering information across multiple documents. To address this gap, we introduce M3SciQA, a multi-modal, multi-document scientific question answering benchmark designed for a more comprehensive evaluation of foundation models. M3SciQA consists of 1,452 expert-annotated questions spanning 70 natural language processing paper clusters, where each cluster represents a primary paper along with all its cited documents, mirroring the workflow of comprehending a single paper by requiring multi-modal and multi-document data. With M3SciQA, we conduct a comprehensive evaluation of 18 foundation models. Our results indicate that current foundation models still significantly underperform compared to human experts in multi-modal information retrieval and in reasoning across multiple scientific documents. Additionally, we explore the implications of these findings for the future advancement of applying foundation models in multi-modal scientific literature analysis.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the current benchmarks for evaluating foundation models mainly focus on single - document, plain - text tasks and cannot fully capture the complexity of the scientific research workflow. Scientific research usually involves interpreting non - text data and gathering information across multiple documents. To fill this gap, the paper introduces M3S CIQA, a multi - modal, multi - document scientific question - answering benchmark, aiming to evaluate the capabilities of foundation models more comprehensively. Specifically, M3S CIQA contains 1,452 expert - annotated questions, covering 70 natural language processing (NLP) paper clusters, with each cluster representing a main paper and all of its cited references. These questions simulate the workflow of understanding a single paper and require models to process multi - modal and multi - document data. Through M3S CIQA, the paper conducts a comprehensive evaluation of 18 foundation models. The results show that current foundation models still lag significantly behind human experts in multi - modal information retrieval and reasoning across multiple scientific documents. In addition, the paper also explores the implications of these findings for the future application of foundation models in multi - modal scientific literature analysis.

M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models

SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark

VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning

ScienceQA: a novel resource for question answering on scholarly articles

Benchmarking Foundation Models with Language-Model-as-an-Examiner

DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models

MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems

ClimaQA: An Automated Evaluation Framework for Climate Foundation Models

DesignQA: A Multimodal Benchmark for Evaluating Large Language Models' Understanding of Engineering Documentation

SciQAG: A Framework for Auto-Generated Science Question Answering Dataset with Fine-grained Evaluation

CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning

SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval

FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering

Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers

ScholarChemQA: Unveiling the Power of Language Models in Chemical Research Question Answering

SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation

RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question Answering and Clinical Reasoning

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models

Multi-modal Retrieval Augmented Multi-modal Generation: A Benchmark, Evaluate Metrics and Strong Baselines