LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Yiheng Xu,Minghao Li,Lei Cui,Shaohan Huang,Furu Wei,Ming Zhou
DOI: https://doi.org/10.1145/3394486.3403172
2020-06-16
Abstract:Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the widespread use of pre-training models for NLP applications, they almost exclusively focus on text-level manipulation, while neglecting layout and style information that is vital for document image understanding. In this paper, we propose the \textbf{LayoutLM} to jointly model interactions between text and layout information across scanned document images, which is beneficial for a great number of real-world document image understanding tasks such as information extraction from scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for document-level pre-training. It achieves new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42). The code and pre-trained LayoutLM models are publicly available at \url{<a class="link-external link-https" href="https://aka.ms/layoutlm" rel="external noopener nofollow">this https URL</a>}.
Computation and Language
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper aims to address a critical issue in Document Image Understanding, specifically how to effectively jointly model text and layout information when processing scanned documents. Existing pre-trained models primarily focus on text-level operations, neglecting layout and style information that is crucial for document image understanding. To overcome this limitation, the authors propose LayoutLM, a new pre-training method that can jointly model text and layout information within a single framework. ### Main Contributions 1. **Joint Pre-training of Text and Layout Information in a Single Framework for the First Time**: By introducing 2-D position embeddings and image embeddings, LayoutLM can capture the relative positions and visual features of words in a document, thereby better understanding the document's structure and content. 2. **Proposing Multi-task Learning Objectives**: Including Masked Visual-Language Model (MVLM) and Multi-label Document Classification (MDC) loss functions, which further strengthen the joint pre-training of text and layout. 3. **Significantly Outperforming Existing Pre-trained Models**: On multiple benchmark datasets, LayoutLM significantly outperforms existing pre-trained models, especially in tasks such as form understanding and receipt information extraction. ### Specific Issues - **Joint Modeling of Text and Layout Information**: Existing pre-trained models mainly focus on text information, neglecting layout information, which limits their performance in document image understanding tasks. - **Large-scale Self-supervised Pre-training**: Most methods rely on a small amount of manually labeled training samples, not fully utilizing the potential of large-scale unlabeled data. - **Fusion of Multi-modal Information**: Existing methods typically use pre-trained computer vision models or natural language processing models, without considering the joint training of text and layout information. ### Solutions - **2-D Position Embeddings**: Used to represent the relative positions of words in a document, helping the model better understand the document's layout. - **Image Embeddings**: Generated by the Faster R-CNN model to capture visual features such as font orientation, type, and color. - **Multi-task Learning**: Combining MVLM and MDC loss functions to achieve joint pre-training of text and layout information. ### Experimental Results - **Form Understanding**: On the FUNSD dataset, LayoutLM's F1 score improved from 70.72 to 79.27. - **Receipt Understanding**: On the SROIE dataset, LayoutLM's F1 score improved from 94.02 to 95.24. - **Document Image Classification**: On the RVL-CDIP dataset, LayoutLM's classification accuracy improved from 93.07 to 94.42. Through these improvements, LayoutLM achieves significant performance enhancements in document image understanding tasks, demonstrating its great potential in practical applications.