Abstract:This paper introduces a deep learning model tailored for document information analysis, emphasizing document classification, entity relation extraction, and document visual question answering. The proposed model leverages transformer-based models to encode all the information present in a document image, including textual, visual, and layout information. The model is pre-trained and subsequently fine-tuned for various document image analysis tasks. The proposed model incorporates three additional tasks during the pre-training phase, including reading order identification of different layout segments in a document image, layout segments categorization as per PubLayNet, and generation of the text sequence within a given layout segment (text block). The model also incorporates a collective pre-training scheme where losses of all the tasks under consideration, including pre-training and fine-tuning tasks with all datasets, are considered. Additional encoder and decoder blocks are added to the RoBERTa network to generate results for all tasks. The proposed model achieved impressive results across all tasks, with an accuracy of 95.87% on the RVL-CDIP dataset for document classification, F1 scores of 0.9306, 0.9804, 0.9794, and 0.8742 on the FUNSD, CORD, SROIE, and Kleister-NDA datasets respectively for entity relation extraction, and an ANLS score of 0.8468 on the DocVQA dataset for visual question answering. The results highlight the effectiveness of the proposed model in understanding and interpreting complex document layouts and content, making it a promising tool for document analysis tasks.

End-to-end Document Recognition and Understanding with Dessurt

OCR-free Document Understanding Transformer

DocFormer: End-to-End Transformer for Document Understanding

Efficient End-to-End Visual Document Understanding with Rationale Distillation

DUBLIN -- Document Understanding By Language-Image Network

Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

DocFormerv2: Local Features for Document Understanding

Unified Pretraining Framework for Document Understanding

SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding

CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor

Enhancing Document Information Analysis with Multi-Task Pre-training: A Robust Approach for Information Extraction in Visually-Rich Documents

GlobalDoc: A Cross-Modal Vision-Language Framework for Real-World Document Image Retrieval and Classification

Deep Learning based Visually Rich Document Content Understanding: A Survey

VRDU: A Benchmark for Visually-rich Document Understanding

Nougat: Neural Optical Understanding for Academic Documents

Sequence-to-Sequence Pre-training with Unified Modality Masking for Visual Document Understanding

Towards Complex Document Understanding by Discrete Reasoning

Unifying Vision, Text, and Layout for Universal Document Processing

SelfDoc: Self-Supervised Document Representation Learning

DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding