Abstract:Recent approaches for visually-rich document understanding (VrDU) uses manually annotated semantic groups, where a semantic group encompasses all semantically relevant but not obviously grouped words. As OCR tools are unable to automatically identify such grouping, we argue that current VrDU approaches are unrealistic. We thus introduce a new variant of the VrDU task, real-world visually-rich document understanding (ReVrDU), that does not allow for using manually annotated semantic groups. We also propose a new method, ReLayout, compliant with the ReVrDU scenario, which learns to capture semantic grouping through arranging words and bringing the representations of words that belong to the potential same semantic group closer together. Our experimental results demonstrate the performance of existing methods is deteriorated with the ReVrDU task, while ReLayout shows superiour performance.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to achieve automated document understanding in visually - rich document understanding (VrDU) tasks without using manually - annotated semantic groups. Specifically, the paper points out that current VrDU methods rely on manually - annotated semantic groups to capture the semantic structure in documents, which is unrealistic in practical applications because automated tools such as OCR cannot automatically recognize these semantic groups. Therefore, the paper proposes a new VrDU variant - real - world visually - rich document understanding (ReVrDU), and a new method named ReLayout. ReLayout, through a pre - training model that enhances layout information, learns how to capture semantic grouping by arranging words and bringing the word representations belonging to potentially the same semantic group closer together, which is suitable for the ReVrDU scenario. The main contributions of the paper are as follows: 1. **Proposing the ReVrDU task**: Different from the existing VrDU tasks, ReVrDU does not allow the use of manually - annotated semantic groups, but only allows the use of information provided by OCR tools, such as words, global 1D positions, word - level 2D bounding boxes and text paragraphs, so as to be closer to the practical application scenarios. 2. **Designing the ReLayout model**: The ReLayout model enhances the understanding of local layout structures and relationships by introducing 1D local order prediction (1 - LOP) and 2D text - paragraph clustering (2 - TSC) strategies, and learns potential semantic group information in a self - supervised manner. 3. **Experimental verification**: Through experiments on multiple datasets, it is proved that ReLayout outperforms existing methods in downstream tasks in both ideal and real - world scenarios, especially showing higher robustness when using information extracted by different OCR tools. In conclusion, this paper aims to promote the research in the VrDU field to be more practical and automated, reduce the dependence on manually - annotated data, and improve the performance and reliability of models in practical applications.

ReLayout: Towards Real-World Document Understanding via Layout-enhanced Pre-training

Modeling Layout Reading Order as Ordering Relations for Visually-rich Document Understanding

XYLayoutLM: Towards Layout-Aware Multimodal Networks for Visually-Rich Document Understanding

VSR: A Unified Framework for Document Layout Analysis Combining Vision, Semantics and Relations

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Enhancing Visually-Rich Document Understanding Via Layout Structure Modeling

Enhancing Visually-Rich Document Understanding via Layout Structure Modeling

A LayoutLMv3-Based Model for Enhanced Relation Extraction in Visually-Rich Documents

ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding

LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding

LayoutReader: Pre-training of Text and Layout for Reading Order Detection

Textual Grounding for Open-vocabulary Visual Information Extraction in Layout-diversified Documents

VRDU: A Benchmark for Visually-rich Document Understanding

Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models

Automatic Layout Planning for Visually-Rich Documents with Instruction-Following Models

VTLayout: Fusion of Visual and Text Features for Document Layout Analysis

Efficient End-to-End Visual Document Understanding with Rationale Distillation

Deep Learning based Visually Rich Document Content Understanding: A Survey

DLAFormer: An End-to-End Transformer For Document Layout Analysis