M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models

Chuhan Li,Ziyao Shangguan,Yilun Zhao,Deyuan Li,Yixin Liu,Arman Cohan
2024-11-07
Abstract:Existing benchmarks for evaluating foundation models mainly focus on single-document, text-only tasks. However, they often fail to fully capture the complexity of research workflows, which typically involve interpreting non-textual data and gathering information across multiple documents. To address this gap, we introduce M3SciQA, a multi-modal, multi-document scientific question answering benchmark designed for a more comprehensive evaluation of foundation models. M3SciQA consists of 1,452 expert-annotated questions spanning 70 natural language processing paper clusters, where each cluster represents a primary paper along with all its cited documents, mirroring the workflow of comprehending a single paper by requiring multi-modal and multi-document data. With M3SciQA, we conduct a comprehensive evaluation of 18 foundation models. Our results indicate that current foundation models still significantly underperform compared to human experts in multi-modal information retrieval and in reasoning across multiple scientific documents. Additionally, we explore the implications of these findings for the future advancement of applying foundation models in multi-modal scientific literature analysis.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the current benchmarks for evaluating foundation models mainly focus on single - document, plain - text tasks and cannot fully capture the complexity of the scientific research workflow. Scientific research usually involves interpreting non - text data and gathering information across multiple documents. To fill this gap, the paper introduces M3S CIQA, a multi - modal, multi - document scientific question - answering benchmark, aiming to evaluate the capabilities of foundation models more comprehensively. Specifically, M3S CIQA contains 1,452 expert - annotated questions, covering 70 natural language processing (NLP) paper clusters, with each cluster representing a main paper and all of its cited references. These questions simulate the workflow of understanding a single paper and require models to process multi - modal and multi - document data. Through M3S CIQA, the paper conducts a comprehensive evaluation of 18 foundation models. The results show that current foundation models still lag significantly behind human experts in multi - modal information retrieval and reasoning across multiple scientific documents. In addition, the paper also explores the implications of these findings for the future application of foundation models in multi - modal scientific literature analysis.