Read and Think: An Efficient Step-wise Multimodal Language Model for Document Understanding and Reasoning

Jinxu Zhang

2024-08-14

Abstract:Understanding the contents of multimodal documents is essential to accurately extract relevant evidence and use it for reasoning. Existing document understanding models tend to generate answers with a single word or phrase directly, ignoring the source document's evidence and lacking interpretability. In this work, we address the lack of step-wise capabilities through data augmentation and extension. Specifically, We use Multi-modal Large Language Models (MLLMs), which have strong visual understanding and reasoning abilities, as data generators to generate step-wise question-and-answer pairs for document images and use a high-performance LLM as the error detector to filter out noisy data. This step-wise data generation pipeline is implemented using both template-based and few-shot methods. We then use the generated high-quality data to train a humanized document understanding and reasoning model, specifically designed to solve complex questions that require reasoning or multi-hop question answering, dubbed DocAssistant. Experimental results demonstrate the effectiveness and application value of step-wise generation, showing a 5 improvement on InfoVQA with complex layouts and a 7 improvement on ChartQA with complex reasoning, compared to directly generated answers. We hope our work highlights the potential of synthetic data and encourages further exploration of multi-modal document reasoning capabilities.

Information Retrieval,Artificial Intelligence,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address two main issues in multimodal document understanding and reasoning: 1. **Lack of step-by-step reasoning ability in existing models when generating answers**: Current document understanding models often generate a single word or phrase as an answer directly, ignoring the evidence and intermediate reasoning steps in the source document, leading to a lack of interpretability in the answers. 2. **Insufficient handling of complex layout documents and reasoning-required questions**: Existing models perform poorly when dealing with documents with complex layouts and questions that require reasoning, especially in terms of information extraction and logical reasoning. To address these issues, the authors propose a data augmentation method based on a multimodal large-scale language model (MLLM) and trained an efficient multimodal document understanding and reasoning model named DocAssistant using high-quality data. Experimental results show that this method achieves significant performance improvements in understanding and reasoning tasks for complex layout documents.

Read and Think: An Efficient Step-wise Multimodal Language Model for Document Understanding and Reasoning

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

An Entailment Tree Generation Approach for Multimodal Multi-Hop Question Answering with Mixture-of-Experts and Iterative Feedback Mechanism

Multimodal Chain-of-Thought Reasoning in Language Models

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey

Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA

MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks

Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning

MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

Multi-modal Large Language Model Enhanced Pseudo 3D Perception Framework for Visual Commonsense Reasoning

Understanding Information Storage and Transfer in Multi-modal Large Language Models

WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models

DocLLM: A layout-aware generative language model for multimodal document understanding

Struct-X: Enhancing Large Language Models Reasoning with Structured Data

Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering

Efficient Multimodal Large Language Models: A Survey

Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation