Abstract:Document layout comprises both structural and visual (eg. font-sizes) information that is vital but often ignored by machine learning models. The few existing models which do use layout information only consider textual contents, and overlook the existence of contents in other modalities such as images. Additionally, spatial interactions of presented contents in a layout were never really fully exploited. To bridge this gap, we parse a document into content blocks (eg. text, table, image) and propose a novel layout-aware multimodal hierarchical framework, LAMPreT, to model the blocks and the whole document. Our LAMPreT encodes each block with a multimodal transformer in the lower-level and aggregates the block-level representations and connections utilizing a specifically designed transformer at the higher-level. We design hierarchical pretraining objectives where the lower-level model is trained similarly to multimodal grounding models, and the higher-level model is trained with our proposed novel layout-aware objectives. We evaluate the proposed model on two layout-aware tasks -- text block filling and image suggestion and show the effectiveness of our proposed hierarchical architecture as well as pretraining techniques.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that existing machine - learning models often overlook the structure and visual information in document layouts when handling document - understanding tasks, or only consider the text content while ignoring other modal content (such as images). In addition, existing models do not make sufficient use of the spatial interactions of the content in the document. To address these issues, the paper proposes a novel layout - aware multimodal hierarchical framework - LAMPreT (Layout - Aware Multimodal PreTraining), aiming to better understand and model the structure and content of documents, especially multimedia content such as images, so as to learn more comprehensive multimodal document representations. Specifically, the LAMPreT framework solves the problem in the following ways: 1. **Content Block Parsing**: First, parse the document into multiple content blocks. Each block can be text, a table, or an image, and extract the position, type, and attribute information of each block. 2. **Hierarchical Structure Modeling**: Adopt a two - level cascaded Transformer model. The bottom - level model is used to encode each content block, and the high - level model aggregates block - level representations and their connections and is processed using a specially designed Transformer. 3. **Hierarchical Pre - training Objectives**: Design hierarchical pre - training objectives. The bottom - level model is trained using the standard masked language modeling (MLM) loss and the image - text matching loss; the high - level model is trained through three layout - aware objectives, namely block - order prediction, masked - block prediction, and image - adaptation prediction. 4. **Downstream Task Evaluation**: Evaluate the proposed model on two layout - aware tasks, namely text - block filling and image - proposal tasks, demonstrating the effectiveness of the proposed hierarchical architecture and pre - training techniques. Through these methods, the LAMPreT framework can more effectively utilize the layout information of documents and improve the performance of document - understanding tasks.

LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding

LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding

ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding

Towards a Multi-modal, Multi-task Learning based Pre-training Framework for Document Representation Learning

ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document Understanding

LAPDoc: Layout-Aware Prompting for Documents

Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding

Multimodal Pre-Training Based on Graph Attention Network for Document Understanding

DocLLM: A layout-aware generative language model for multimodal document understanding

Visually Guided Generative Text-Layout Pre-training for Document Intelligence

Unified Pretraining Framework for Document Understanding

LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding

HiM: hierarchical multimodal network for document layout analysis

DLAFormer: An End-to-End Transformer For Document Layout Analysis

In-context Pretraining: Language Modeling Beyond Document Boundaries

Wukong-Reader: Multi-modal Pre-training for Fine-grained Visual Document Understanding