Abstract:Structure information is critical for understanding the semantics of text-rich images, such as documents, tables, and charts. Existing Multimodal Large Language Models (MLLMs) for Visual Document Understanding are equipped with text recognition ability but lack general structure understanding abilities for text-rich document images. In this work, we emphasize the importance of structure information in Visual Document Understanding and propose the Unified Structure Learning to boost the performance of MLLMs. Our Unified Structure Learning comprises structure-aware parsing tasks and multi-grained text localization tasks across 5 domains: document, webpage, table, chart, and natural image. To better encode structure information, we design a simple and effective vision-to-text module H-Reducer, which can not only maintain the layout information but also reduce the length of visual features by merging horizontal adjacent patches through convolution, enabling the LLM to understand high-resolution images more efficiently. Furthermore, by constructing structure-aware text sequences and multi-grained pairs of texts and bounding boxes for publicly available text-rich images, we build a comprehensive training set DocStruct4M to support structure learning. Finally, we construct a small but high-quality reasoning tuning dataset DocReason25K to trigger the detailed explanation ability in the document domain. Our model DocOwl 1.5 achieves state-of-the-art performance on 10 visual document understanding benchmarks, improving the SOTA performance of MLLMs with a 7B LLM by more than 10 points in 5/10 benchmarks. Our codes, models, and datasets are publicly available at

DocParser: Hierarchical Structure Parsing of Document Renderings

DSG: An End-to-End Document Structure Generator

DocHieNet: A Large and Diverse Dataset for Document Hierarchy Parsing

HRDoc: Dataset and Baseline Method Toward Hierarchical Reconstruction of Document Structures

DocStruct: A Multimodal Method to Extract Hierarchy Structure in Document for General Form Understanding

Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis

Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

TableParser: Automatic Table Parsing with Weak Supervision from Spreadsheets

Interpretable Structure-aware Document Encoders with Hierarchical Attention

Hierarchical Human Parsing with Typed Part-Relation Reasoning

DocParser: End-to-end OCR-free Information Extraction from Visually Rich Documents

Capturing Logical Structure of Visually Structured Documents with Multimodal Transition Parser

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

Structure extraction from PDF-based book documents.

PP-StructureV2: A Stronger Document Analysis System

Document Structure in Long Document Transformers

Deep Hierarchical Human Semantic Parsing

WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data

READoc: A Unified Benchmark for Realistic Document Structured Extraction

Hierarchical Logical Structure Extraction of Book Documents by Analyzing Tables of Contents

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations