Abstract:Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually-rich document understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. In this paper, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually-rich document understanding. To accurately evaluate LayoutXLM, we also introduce a multilingual form understanding benchmark dataset named XFUND, which includes form understanding samples in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese), and key-value pairs are manually labeled for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUND dataset. The pre-trained LayoutXLM model and the XFUND dataset are publicly available at <a class="link-external link-https" href="https://aka.ms/layoutxlm" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the language barrier in multilingual visually - rich document understanding. Specifically, the existing cross - language pre - training models perform poorly when dealing with visually - rich documents (such as tables, forms, etc.), because these models mainly rely on text information and ignore the layout and image information of the documents. Moreover, most of the existing document - understanding datasets and benchmarks are limited to English, which restricts the research progress of non - English document understanding. To solve these problems, the author proposes **LayoutXLM**, which is a multimodal pre - training model aiming to improve the understanding ability of multilingual visually - rich documents by jointly learning text, layout and image information. In addition, the author also introduces a multilingual form - understanding benchmark dataset named **XFUND**, which contains manually - annotated samples in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese) to evaluate the performance of the LayoutXLM model. ### Main contributions: 1. **Propose LayoutXLM**: A multimodal pre - training model for multilingual visually - rich document understanding, which is pre - trained on large - scale real - world scanned and digitized documents. 2. **Introduce XFUND dataset**: A multilingual form - understanding benchmark dataset that contains manually - annotated form samples in 7 languages and is used to evaluate model performance. 3. **Experimental results show**: LayoutXLM outperforms other state - of - the - art cross - language pre - training models on the XFUND dataset, demonstrating the potential of the multimodal pre - training strategy in multilingual document - understanding tasks. Through these works, the author hopes to bridge the gap between different languages and promote the research progress in the field of multilingual visually - rich document understanding.

LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

XYLayoutLM: Towards Layout-Aware Multimodal Networks for Visually-Rich Document Understanding

LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding

LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding

Enhancing Visually-Rich Document Understanding Via Layout Structure Modeling

Enhancing Visually-Rich Document Understanding via Layout Structure Modeling

LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

DocLLM: A layout-aware generative language model for multimodal document understanding

LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding

LayoutReader: Pre-training of Text and Layout for Reading Order Detection

LAPDoc: Layout-Aware Prompting for Documents

DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception

A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding

ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding

LMDX: Language Model-based Document Information Extraction and Localization

ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document Understanding

Large Language Models Understand Layout