Abstract:The ability to understand and answer questions over documents can be useful in many business and practical applications. However, documents often contain lengthy and diverse multimodal contents such as texts, figures, and tables, which are very time-consuming for humans to read thoroughly. Hence, there is an urgent need to develop effective and automated methods to aid humans in this task. In this work, we introduce M-LongDoc, a benchmark of 851 samples, and an automated framework to evaluate the performance of large multimodal models. We further propose a retrieval-aware tuning approach for efficient and effective multimodal document reading. Compared to existing works, our benchmark consists of more recent and lengthy documents with hundreds of pages, while also requiring open-ended solutions and not just extractive answers. To our knowledge, our training framework is the first to directly address the retrieval setting for multimodal long documents. To enable tuning open-source models, we construct a training corpus in a fully automatic manner for the question-answering task over such documents. Experiments show that our tuning approach achieves a relative improvement of 4.6% for the correctness of model responses, compared to the baseline open-source models. Our data, code, and models are available at <a class="link-external link-https" href="https://multimodal-documents.github.io" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the insufficient ability of existing multi - modal document understanding models when dealing with long - length, complex documents that contain multiple forms of content (such as text, charts, etc.). Specifically, the paper points out: 1. **Limitations of existing benchmark datasets**: Most existing benchmark datasets mainly focus on shorter documents (usually less than 50 pages), and the question types are relatively simple, mainly focusing on extractive questions. These benchmark datasets fail to fully reflect the challenges of long - document understanding tasks in the real world. 2. **Challenges in multi - modal long - document understanding**: Documents in the real world are often very long (hundreds of pages), with complex content and containing multi - modal information (such as text, charts, tables, etc.). Existing multi - modal models have difficulties in dealing with such documents, especially performing poorly in understanding and answering questions based on charts and tables. 3. **Evaluation difficulties of open - ended questions**: Since long - document understanding tasks often require generating open - ended answers rather than simple extractive answers, the evaluation of these answers becomes very challenging and requires a scalable and standardized evaluation method. To address these problems, the paper proposes the following solutions: - **M - LongDoc benchmark dataset**: A benchmark dataset containing 851 samples, specifically designed to evaluate the performance of large multi - modal models when dealing with long - length, diverse documents. The average length of documents in this dataset exceeds 200 pages, covering content in multiple fields such as academia, finance, and products. - **Automated evaluation framework**: An automated evaluation framework based on multi - modal models for evaluating the quality of answers to open - ended questions. This framework scores through multiple evaluation models, providing a reliable and scalable evaluation method. - **Retrieval - enhanced fine - tuning method**: A new fine - tuning method aimed at improving the robustness and effectiveness of the model when dealing with long documents. This method introduces interfering content during the training process, enabling the model to better identify and utilize relevant content, thereby reducing the risk of being misled by irrelevant information. Through these solutions, the paper hopes to promote the development of multi - modal long - document understanding technology and make it more suitable for practical applications.

M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework

LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating

MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding

Multi-view Content-aware Indexing for Long Document Retrieval

Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs

PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling

DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding

Needle In A Multimodal Haystack

DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding

TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism

Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA

Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents

Read and Think: An Efficient Step-wise Multimodal Language Model for Document Understanding and Reasoning

RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question Answering and Clinical Reasoning

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

A Survey on Benchmarks of Multimodal Large Language Models

Document Modeling with Graph Attention Networks for Multi-grained Machine Reading Comprehension