mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Jiabo Ye,Anwen Hu,Haiyang Xu,Qinghao Ye,Ming Yan,Yuhao Dan,Chenlin Zhao,Guohai Xu,Chenliang Li,Junfeng Tian,Qian Qi,Ji Zhang,Fei Huang

2023-07-04

Abstract:Document understanding refers to automatically extract, analyze and comprehend information from various types of digital documents, such as a web page. Existing Multi-model Large Language Models (MLLMs), including mPLUG-Owl, have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition, indicating their potential for OCR-free document understanding. Nevertheless, without in-domain training, these models tend to ignore fine-grained OCR features, such as sophisticated tables or large blocks of text, which are essential for OCR-free document understanding. In this paper, we propose mPLUG-DocOwl based on mPLUG-Owl for OCR-free document understanding. Specifically, we first construct a instruction tuning dataset featuring a wide range of visual-text understanding tasks. Then, we strengthen the OCR-free document understanding ability by jointly train the model on language-only, general vision-and-language, and document instruction tuning dataset with our unified instruction tuning strategy. We also build an OCR-free document instruction understanding evaluation set LLMDoc to better compare models' capabilities on instruct compliance and document understanding. Experimental results show that our model outperforms existing multi-modal models, demonstrating its strong ability of document understanding. Besides, without specific fine-tuning, mPLUG-DocOwl generalizes well on various downstream tasks. Our code, models, training data and evaluation set are available at <a class="link-external link-https" href="https://github.com/X-PLUG/mPLUG-DocOwl" rel="external noopener nofollow">this https URL</a>.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The paper mainly addresses the following issues: 1. **Proposed a new Multimodal Large Language Model (MLLM)**: Named mPLUG-DocOwl, this model is developed based on mPLUG-Owl and aims to enhance the model's ability to understand documents, especially without the need for Optical Character Recognition (OCR). 2. **Improved understanding of complex document features**: Existing models have deficiencies in handling fine-grained OCR features such as complex tables or large blocks of text, which are crucial for OCR-free document understanding. mPLUG-DocOwl addresses this issue through specific training. 3. **Constructed an instruction-tuning dataset**: This dataset includes various visual text understanding tasks to enhance the model's OCR-free document understanding capabilities. 4. **Proposed a unified instruction-tuning strategy**: This strategy jointly trains the model on pure text, general vision and language, and document instruction-tuning datasets to improve the model's performance on different tasks. 5. **Established an OCR-free document instruction understanding evaluation set**: Named LLMDoc, it is used to better compare the model's capabilities in instruction adherence and document understanding. 6. **Experimental results show**: mPLUG-DocOwl achieves state-of-the-art performance on multiple commonly used OCR-free document understanding datasets and demonstrates good generalization ability on various downstream tasks without specific fine-tuning. In short, the goal of the paper is to improve the understanding of complex structures in documents by proposing a new multimodal model, particularly in the context of document understanding without relying on OCR technology. In this way, the model can exhibit excellent performance in various document understanding tasks.

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs

DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

Read and Think: An Efficient Step-wise Multimodal Language Model for Document Understanding and Reasoning

TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

DocLayLLM: An Efficient and Effective Multi-modal Extension of Large Language Models for Text-rich Document Understanding

MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding

Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions

Osprey: Pixel Understanding with Visual Instruction Tuning

TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding

Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs