Abstract:Modeling and leveraging layout reading order in visually-rich documents (VrDs) is critical in document intelligence as it captures the rich structure semantics within documents. Previous works typically formulated layout reading order as a permutation of layout elements, i.e. a sequence containing all the layout elements. However, we argue that this formulation does not adequately convey the complete reading order information in the layout, which may potentially lead to performance decline in downstream VrD tasks. To address this issue, we propose to model the layout reading order as ordering relations over the set of layout elements, which have sufficient expressive capability for the complete reading order information. To enable empirical evaluation on methods towards the improved form of reading order prediction (ROP), we establish a comprehensive benchmark dataset including the reading order annotation as relations over layout elements, together with a relation-extraction-based method that outperforms previous methods. Moreover, to highlight the practical benefits of introducing the improved form of layout reading order, we propose a reading-order-relation-enhancing pipeline to improve model performance on any arbitrary VrD task by introducing additional reading order relation inputs. Comprehensive results demonstrate that the pipeline generally benefits downstream VrD tasks: (1) with utilizing the reading order relation information, the enhanced downstream models achieve SOTA results on both two task settings of the targeted dataset; (2) with utilizing the pseudo reading order information generated by the proposed ROP model, the performance of the enhanced models has improved across all three models and eight cross-domain VrD-IE/QA task settings without targeted optimization.

VSR: A Unified Framework for Document Layout Analysis Combining Vision, Semantics and Relations

Textual Grounding for Open-vocabulary Visual Information Extraction in Layout-diversified Documents

Modeling Layout Reading Order as Ordering Relations for Visually-rich Document Understanding

A LayoutLMv3-Based Model for Enhanced Relation Extraction in Visually-Rich Documents

ReLayout: Towards Real-World Document Understanding via Layout-enhanced Pre-training

VTLayout: Fusion of Visual and Text Features for Document Layout Analysis

Reading order detection in visually-rich documents with multi-modal layout-aware relation prediction

Layout Generation for Various Scenarios in Mobile Shopping Applications.

Document AI: A Comparative Study of Transformer-Based, Graph-Based Models, and Convolutional Neural Networks For Document Layout Analysis

VASCAR: Content-Aware Layout Generation via Visual-Aware Self-Correction

A Fusion Framework of Whitespace Smear Cutting and Swin Transformer for Document Layout Analysis

DocLLM: A layout-aware generative language model for multimodal document understanding

Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models

Towards Flexible Visual Relationship Segmentation

PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

UniVIE: A Unified Label Space Approach to Visual Information Extraction from Form-like Documents

LayoutReader: Pre-training of Text and Layout for Reading Order Detection

PP-StructureV2: A Stronger Document Analysis System

Deep Learning based Visually Rich Document Content Understanding: A Survey

Visual Spatio-temporal Relation-enhanced Network for Cross-modal Text-Video Retrieval