M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework

Yew Ken Chia,Liying Cheng,Hou Pong Chan,Chaoqun Liu,Maojia Song,Sharifah Mahani Aljunied,Soujanya Poria,Lidong Bing
2024-11-09
Abstract:The ability to understand and answer questions over documents can be useful in many business and practical applications. However, documents often contain lengthy and diverse multimodal contents such as texts, figures, and tables, which are very time-consuming for humans to read thoroughly. Hence, there is an urgent need to develop effective and automated methods to aid humans in this task. In this work, we introduce M-LongDoc, a benchmark of 851 samples, and an automated framework to evaluate the performance of large multimodal models. We further propose a retrieval-aware tuning approach for efficient and effective multimodal document reading. Compared to existing works, our benchmark consists of more recent and lengthy documents with hundreds of pages, while also requiring open-ended solutions and not just extractive answers. To our knowledge, our training framework is the first to directly address the retrieval setting for multimodal long documents. To enable tuning open-source models, we construct a training corpus in a fully automatic manner for the question-answering task over such documents. Experiments show that our tuning approach achieves a relative improvement of 4.6% for the correctness of model responses, compared to the baseline open-source models. Our data, code, and models are available at <a class="link-external link-https" href="https://multimodal-documents.github.io" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the insufficient ability of existing multi - modal document understanding models when dealing with long - length, complex documents that contain multiple forms of content (such as text, charts, etc.). Specifically, the paper points out: 1. **Limitations of existing benchmark datasets**: Most existing benchmark datasets mainly focus on shorter documents (usually less than 50 pages), and the question types are relatively simple, mainly focusing on extractive questions. These benchmark datasets fail to fully reflect the challenges of long - document understanding tasks in the real world. 2. **Challenges in multi - modal long - document understanding**: Documents in the real world are often very long (hundreds of pages), with complex content and containing multi - modal information (such as text, charts, tables, etc.). Existing multi - modal models have difficulties in dealing with such documents, especially performing poorly in understanding and answering questions based on charts and tables. 3. **Evaluation difficulties of open - ended questions**: Since long - document understanding tasks often require generating open - ended answers rather than simple extractive answers, the evaluation of these answers becomes very challenging and requires a scalable and standardized evaluation method. To address these problems, the paper proposes the following solutions: - **M - LongDoc benchmark dataset**: A benchmark dataset containing 851 samples, specifically designed to evaluate the performance of large multi - modal models when dealing with long - length, diverse documents. The average length of documents in this dataset exceeds 200 pages, covering content in multiple fields such as academia, finance, and products. - **Automated evaluation framework**: An automated evaluation framework based on multi - modal models for evaluating the quality of answers to open - ended questions. This framework scores through multiple evaluation models, providing a reliable and scalable evaluation method. - **Retrieval - enhanced fine - tuning method**: A new fine - tuning method aimed at improving the robustness and effectiveness of the model when dealing with long documents. This method introduces interfering content during the training process, enabling the model to better identify and utilize relevant content, thereby reducing the risk of being misled by irrelevant information. Through these solutions, the paper hopes to promote the development of multi - modal long - document understanding technology and make it more suitable for practical applications.