LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding

Te-Lin Wu,Cheng Li,Mingyang Zhang,Tao Chen,Spurthi Amba Hombaiah,Michael Bendersky
DOI: https://doi.org/10.48550/arXiv.2104.08405
2021-04-17
Abstract:Document layout comprises both structural and visual (eg. font-sizes) information that is vital but often ignored by machine learning models. The few existing models which do use layout information only consider textual contents, and overlook the existence of contents in other modalities such as images. Additionally, spatial interactions of presented contents in a layout were never really fully exploited. To bridge this gap, we parse a document into content blocks (eg. text, table, image) and propose a novel layout-aware multimodal hierarchical framework, LAMPreT, to model the blocks and the whole document. Our LAMPreT encodes each block with a multimodal transformer in the lower-level and aggregates the block-level representations and connections utilizing a specifically designed transformer at the higher-level. We design hierarchical pretraining objectives where the lower-level model is trained similarly to multimodal grounding models, and the higher-level model is trained with our proposed novel layout-aware objectives. We evaluate the proposed model on two layout-aware tasks -- text block filling and image suggestion and show the effectiveness of our proposed hierarchical architecture as well as pretraining techniques.
Computation and Language,Computer Vision and Pattern Recognition,Information Retrieval
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that existing machine - learning models often overlook the structure and visual information in document layouts when handling document - understanding tasks, or only consider the text content while ignoring other modal content (such as images). In addition, existing models do not make sufficient use of the spatial interactions of the content in the document. To address these issues, the paper proposes a novel layout - aware multimodal hierarchical framework - LAMPreT (Layout - Aware Multimodal PreTraining), aiming to better understand and model the structure and content of documents, especially multimedia content such as images, so as to learn more comprehensive multimodal document representations. Specifically, the LAMPreT framework solves the problem in the following ways: 1. **Content Block Parsing**: First, parse the document into multiple content blocks. Each block can be text, a table, or an image, and extract the position, type, and attribute information of each block. 2. **Hierarchical Structure Modeling**: Adopt a two - level cascaded Transformer model. The bottom - level model is used to encode each content block, and the high - level model aggregates block - level representations and their connections and is processed using a specially designed Transformer. 3. **Hierarchical Pre - training Objectives**: Design hierarchical pre - training objectives. The bottom - level model is trained using the standard masked language modeling (MLM) loss and the image - text matching loss; the high - level model is trained through three layout - aware objectives, namely block - order prediction, masked - block prediction, and image - adaptation prediction. 4. **Downstream Task Evaluation**: Evaluate the proposed model on two layout - aware tasks, namely text - block filling and image - proposal tasks, demonstrating the effectiveness of the proposed hierarchical architecture and pre - training techniques. Through these methods, the LAMPreT framework can more effectively utilize the layout information of documents and improve the performance of document - understanding tasks.