Read and Think: An Efficient Step-wise Multimodal Language Model for Document Understanding and Reasoning

Jinxu Zhang
2024-08-14
Abstract:Understanding the contents of multimodal documents is essential to accurately extract relevant evidence and use it for reasoning. Existing document understanding models tend to generate answers with a single word or phrase directly, ignoring the source document's evidence and lacking interpretability. In this work, we address the lack of step-wise capabilities through data augmentation and extension. Specifically, We use Multi-modal Large Language Models (MLLMs), which have strong visual understanding and reasoning abilities, as data generators to generate step-wise question-and-answer pairs for document images and use a high-performance LLM as the error detector to filter out noisy data. This step-wise data generation pipeline is implemented using both template-based and few-shot methods. We then use the generated high-quality data to train a humanized document understanding and reasoning model, specifically designed to solve complex questions that require reasoning or multi-hop question answering, dubbed DocAssistant. Experimental results demonstrate the effectiveness and application value of step-wise generation, showing a 5 improvement on InfoVQA with complex layouts and a 7 improvement on ChartQA with complex reasoning, compared to directly generated answers. We hope our work highlights the potential of synthetic data and encourages further exploration of multi-modal document reasoning capabilities.
Information Retrieval,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address two main issues in multimodal document understanding and reasoning: 1. **Lack of step-by-step reasoning ability in existing models when generating answers**: Current document understanding models often generate a single word or phrase as an answer directly, ignoring the evidence and intermediate reasoning steps in the source document, leading to a lack of interpretability in the answers. 2. **Insufficient handling of complex layout documents and reasoning-required questions**: Existing models perform poorly when dealing with documents with complex layouts and questions that require reasoning, especially in terms of information extraction and logical reasoning. To address these issues, the authors propose a data augmentation method based on a multimodal large-scale language model (MLLM) and trained an efficient multimodal document understanding and reasoning model named DocAssistant using high-quality data. Experimental results show that this method achieves significant performance improvements in understanding and reasoning tasks for complex layout documents.