Abstract:We present a novel OCR-free document understanding framework based on pretrained Multimodal Large Language Models (MLLMs). Our approach employs multi-scale visual features to effectively handle various font sizes within document images. To address the increasing costs of considering the multi-scale visual inputs for MLLMs, we propose the Hierarchical Visual Feature Aggregation (HVFA) module, designed to reduce the number of input tokens to LLMs. Leveraging a feature pyramid with cross-attentive pooling, our approach effectively manages the trade-off between information loss and efficiency without being affected by varying document image sizes. Furthermore, we introduce a novel instruction tuning task, which facilitates the model's text-reading capability by learning to predict the relative positions of input text, eventually minimizing the risk of truncated text caused by the limited capacity of LLMs. Comprehensive experiments validate the effectiveness of our approach, demonstrating superior performance in various document understanding tasks.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the challenges faced by existing document understanding frameworks when dealing with multi - scale document images, especially how to effectively process text and visual elements with different font sizes and complex layouts in documents without relying on external OCR engines. Specifically, the paper focuses on the following points: 1. **Processing of multi - scale visual input**: Existing document understanding models have difficulty effectively processing document images containing multiple font sizes and complex layouts, especially when facing high - resolution images and different aspect ratios. 2. **Increase in computational cost**: In order to process multi - scale visual input, models require more computational resources. Especially for large - scale language models (LLMs) with self - attention mechanisms, their computational complexity is quadratic, which significantly increases the cost of processing multi - scale features. 3. **Text truncation problem**: Due to the limited input capacity of large - scale language models, directly reading the entire document text may lead to information loss, especially when dealing with long texts. To solve these problems, the paper proposes an OCR - free document understanding framework based on pre - trained multi - modal large models (MLLMs) and introduces the following key techniques: - **Hierarchical Visual Feature Aggregation (HVFA) module**: By constructing a feature pyramid and applying cross - attention pooling to reduce the number of visual tokens input to LLMs, thereby reducing computational costs while retaining key information. - **Relative Text - Position Prediction Task (RTPP)**: By learning to predict the relative positions of input texts, the text - reading ability of the model is enhanced, and the text truncation problem caused by the input capacity limitation of LLMs is avoided. These techniques work together to enable the framework to perform well in various document understanding tasks, especially when dealing with multi - scale document images.

Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding

HRVDA: High-Resolution Visual Document Assistant

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

MGDoc: Pre-training with Multi-granular Hierarchy for Document Image Understanding

AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

Hierarchical visual-semantic interaction for scene text recognition

3MVRD: Multimodal Multi-task Multi-teacher Visually-Rich Form Document Understanding

Focus Anywhere for Fine-grained Multi-page Document Understanding

Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding

Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs

DocLayLLM: An Efficient and Effective Multi-modal Extension of Large Language Models for Text-rich Document Understanding

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid