Abstract:Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the widespread use of pre-training models for NLP applications, they almost exclusively focus on text-level manipulation, while neglecting layout and style information that is vital for document image understanding. In this paper, we propose the \textbf{LayoutLM} to jointly model interactions between text and layout information across scanned document images, which is beneficial for a great number of real-world document image understanding tasks such as information extraction from scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for document-level pre-training. It achieves new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42). The code and pre-trained LayoutLM models are publicly available at \url{<a class="link-external link-https" href="https://aka.ms/layoutlm" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper aims to address a critical issue in Document Image Understanding, specifically how to effectively jointly model text and layout information when processing scanned documents. Existing pre-trained models primarily focus on text-level operations, neglecting layout and style information that is crucial for document image understanding. To overcome this limitation, the authors propose LayoutLM, a new pre-training method that can jointly model text and layout information within a single framework. ### Main Contributions 1. **Joint Pre-training of Text and Layout Information in a Single Framework for the First Time**: By introducing 2-D position embeddings and image embeddings, LayoutLM can capture the relative positions and visual features of words in a document, thereby better understanding the document's structure and content. 2. **Proposing Multi-task Learning Objectives**: Including Masked Visual-Language Model (MVLM) and Multi-label Document Classification (MDC) loss functions, which further strengthen the joint pre-training of text and layout. 3. **Significantly Outperforming Existing Pre-trained Models**: On multiple benchmark datasets, LayoutLM significantly outperforms existing pre-trained models, especially in tasks such as form understanding and receipt information extraction. ### Specific Issues - **Joint Modeling of Text and Layout Information**: Existing pre-trained models mainly focus on text information, neglecting layout information, which limits their performance in document image understanding tasks. - **Large-scale Self-supervised Pre-training**: Most methods rely on a small amount of manually labeled training samples, not fully utilizing the potential of large-scale unlabeled data. - **Fusion of Multi-modal Information**: Existing methods typically use pre-trained computer vision models or natural language processing models, without considering the joint training of text and layout information. ### Solutions - **2-D Position Embeddings**: Used to represent the relative positions of words in a document, helping the model better understand the document's layout. - **Image Embeddings**: Generated by the Faster R-CNN model to capture visual features such as font orientation, type, and color. - **Multi-task Learning**: Combining MVLM and MDC loss functions to achieve joint pre-training of text and layout information. ### Experimental Results - **Form Understanding**: On the FUNSD dataset, LayoutLM's F1 score improved from 70.72 to 79.27. - **Receipt Understanding**: On the SROIE dataset, LayoutLM's F1 score improved from 94.02 to 95.24. - **Document Image Classification**: On the RVL-CDIP dataset, LayoutLM's classification accuracy improved from 93.07 to 94.42. Through these improvements, LayoutLM achieves significant performance enhancements in document image understanding tasks, demonstrating its great potential in practical applications.

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding

LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding

LayoutReader: Pre-training of Text and Layout for Reading Order Detection

ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding

Enhancing Visually-Rich Document Understanding Via Layout Structure Modeling

XYLayoutLM: Towards Layout-Aware Multimodal Networks for Visually-Rich Document Understanding

LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding

Enhancing Visually-Rich Document Understanding via Layout Structure Modeling

LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

LAPDoc: Layout-Aware Prompting for Documents

DocLLM: A layout-aware generative language model for multimodal document understanding

VTLayout: Fusion of Visual and Text Features for Document Layout Analysis

A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding

DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception

Image Layer Modeling for Complex Document Layout Generation.

A Fusion Framework of Whitespace Smear Cutting and Swin Transformer for Document Layout Analysis