Abstract:Structure information is critical for understanding the semantics of text-rich images, such as documents, tables, and charts. Existing Multimodal Large Language Models (MLLMs) for Visual Document Understanding are equipped with text recognition ability but lack general structure understanding abilities for text-rich document images. In this work, we emphasize the importance of structure information in Visual Document Understanding and propose the Unified Structure Learning to boost the performance of MLLMs. Our Unified Structure Learning comprises structure-aware parsing tasks and multi-grained text localization tasks across 5 domains: document, webpage, table, chart, and natural image. To better encode structure information, we design a simple and effective vision-to-text module H-Reducer, which can not only maintain the layout information but also reduce the length of visual features by merging horizontal adjacent patches through convolution, enabling the LLM to understand high-resolution images more efficiently. Furthermore, by constructing structure-aware text sequences and multi-grained pairs of texts and bounding boxes for publicly available text-rich images, we build a comprehensive training set DocStruct4M to support structure learning. Finally, we construct a small but high-quality reasoning tuning dataset DocReason25K to trigger the detailed explanation ability in the document domain. Our model DocOwl 1.5 achieves state-of-the-art performance on 10 visual document understanding benchmarks, improving the SOTA performance of MLLMs with a 7B LLM by more than 10 points in 5/10 benchmarks. Our codes, models, and datasets are publicly available at

A Machine Learning Framework for Data Ingestion in Document Images

Automatic Document Metadata Extraction Based on Deep Networks.

CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor

Document AI: Benchmarks, Models and Applications

DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding

Corpus Conversion Service: A machine learning platform to ingest documents at scale [Poster abstract]

A Survey of Deep Learning Approaches for OCR and Document Understanding

PP-StructureV2: A Stronger Document Analysis System

Unfolding the Structure of a Document using Deep Learning

TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding

Machine Identification of High Impact Research through Text and Image Analysis

LongFin: A Multimodal Document Understanding Model for Long Financial Domain Documents

A Mahout Based Image Classification Framework for Very Large Dataset

Efficient Document Image Classification Using Region-Based Graph Neural Network

Deep Learning based Visually Rich Document Content Understanding: A Survey

DocMamba: Efficient Document Pre-training with State Space Model

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

GlobalDoc: A Cross-Modal Vision-Language Framework for Real-World Document Image Retrieval and Classification

DocumentNet: Bridging the Data Gap in Document Pre-Training

Unifying Multimodal Retrieval via Document Screenshot Embedding