LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding

Yiheng Xu,Tengchao Lv,Lei Cui,Guoxin Wang,Yijuan Lu,Dinei Florencio,Cha Zhang,Furu Wei
DOI: https://doi.org/10.48550/arXiv.2104.08836
2021-09-09
Abstract:Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually-rich document understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. In this paper, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually-rich document understanding. To accurately evaluate LayoutXLM, we also introduce a multilingual form understanding benchmark dataset named XFUND, which includes form understanding samples in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese), and key-value pairs are manually labeled for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUND dataset. The pre-trained LayoutXLM model and the XFUND dataset are publicly available at <a class="link-external link-https" href="https://aka.ms/layoutxlm" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the language barrier in multilingual visually - rich document understanding. Specifically, the existing cross - language pre - training models perform poorly when dealing with visually - rich documents (such as tables, forms, etc.), because these models mainly rely on text information and ignore the layout and image information of the documents. Moreover, most of the existing document - understanding datasets and benchmarks are limited to English, which restricts the research progress of non - English document understanding. To solve these problems, the author proposes **LayoutXLM**, which is a multimodal pre - training model aiming to improve the understanding ability of multilingual visually - rich documents by jointly learning text, layout and image information. In addition, the author also introduces a multilingual form - understanding benchmark dataset named **XFUND**, which contains manually - annotated samples in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese) to evaluate the performance of the LayoutXLM model. ### Main contributions: 1. **Propose LayoutXLM**: A multimodal pre - training model for multilingual visually - rich document understanding, which is pre - trained on large - scale real - world scanned and digitized documents. 2. **Introduce XFUND dataset**: A multilingual form - understanding benchmark dataset that contains manually - annotated form samples in 7 languages and is used to evaluate model performance. 3. **Experimental results show**: LayoutXLM outperforms other state - of - the - art cross - language pre - training models on the XFUND dataset, demonstrating the potential of the multimodal pre - training strategy in multilingual document - understanding tasks. Through these works, the author hopes to bridge the gap between different languages and promote the research progress in the field of multilingual visually - rich document understanding.