Abstract:Recent advances in training large language models (LLMs) using massive amounts of solely textual data lead to strong generalization across many domains and tasks, including document-specific tasks. Opposed to that there is a trend to train multi-modal transformer architectures tailored for document understanding that are designed specifically to fuse textual inputs with the corresponding document layout. This involves a separate fine-tuning step for which additional training data is required. At present, no document transformers with comparable generalization to LLMs are available That raises the question which type of model is to be preferred for document understanding tasks. In this paper we investigate the possibility to use purely text-based LLMs for document-specific tasks by using layout enrichment. We explore drop-in modifications and rule-based methods to enrich purely textual LLM prompts with layout information. In our experiments we investigate the effects on the commercial ChatGPT model and the open-source LLM Solar. We demonstrate that using our approach both LLMs show improved performance on various standard document benchmarks. In addition, we study the impact of noisy OCR and layout errors, as well as the limitations of LLMs when it comes to utilizing document layout. Our results indicate that layout enrichment can improve the performance of purely text-based LLMs for document understanding by up to 15% compared to just using plain document text. In conclusion, this approach should be considered for the best model choice between text-based LLM or multi-modal document transformers.

LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding

LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding

DocLLM: A layout-aware generative language model for multimodal document understanding

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

DocLayLLM: An Efficient and Effective Multi-modal Extension of Large Language Models for Text-rich Document Understanding

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Enhancing Visually-Rich Document Understanding Via Layout Structure Modeling

LAPDoc: Layout-Aware Prompting for Documents

Enhancing Visually-Rich Document Understanding via Layout Structure Modeling

Large Language Models Understand Layout

A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding

LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding

LayoutGPT: Compositional Visual Planning and Generation with Large Language Models

Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models

LawLLM: Law Large Language Model for the US Legal System

LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding

VideoLLM: Modeling Video Sequence with Large Language Models