Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding

Jaeyoo Park,Jin Young Choi,Jeonghyung Park,Bohyung Han
2024-11-08
Abstract:We present a novel OCR-free document understanding framework based on pretrained Multimodal Large Language Models (MLLMs). Our approach employs multi-scale visual features to effectively handle various font sizes within document images. To address the increasing costs of considering the multi-scale visual inputs for MLLMs, we propose the Hierarchical Visual Feature Aggregation (HVFA) module, designed to reduce the number of input tokens to LLMs. Leveraging a feature pyramid with cross-attentive pooling, our approach effectively manages the trade-off between information loss and efficiency without being affected by varying document image sizes. Furthermore, we introduce a novel instruction tuning task, which facilitates the model's text-reading capability by learning to predict the relative positions of input text, eventually minimizing the risk of truncated text caused by the limited capacity of LLMs. Comprehensive experiments validate the effectiveness of our approach, demonstrating superior performance in various document understanding tasks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the challenges faced by existing document understanding frameworks when dealing with multi - scale document images, especially how to effectively process text and visual elements with different font sizes and complex layouts in documents without relying on external OCR engines. Specifically, the paper focuses on the following points: 1. **Processing of multi - scale visual input**: Existing document understanding models have difficulty effectively processing document images containing multiple font sizes and complex layouts, especially when facing high - resolution images and different aspect ratios. 2. **Increase in computational cost**: In order to process multi - scale visual input, models require more computational resources. Especially for large - scale language models (LLMs) with self - attention mechanisms, their computational complexity is quadratic, which significantly increases the cost of processing multi - scale features. 3. **Text truncation problem**: Due to the limited input capacity of large - scale language models, directly reading the entire document text may lead to information loss, especially when dealing with long texts. To solve these problems, the paper proposes an OCR - free document understanding framework based on pre - trained multi - modal large models (MLLMs) and introduces the following key techniques: - **Hierarchical Visual Feature Aggregation (HVFA) module**: By constructing a feature pyramid and applying cross - attention pooling to reduce the number of visual tokens input to LLMs, thereby reducing computational costs while retaining key information. - **Relative Text - Position Prediction Task (RTPP)**: By learning to predict the relative positions of input texts, the text - reading ability of the model is enhanced, and the text truncation problem caused by the input capacity limitation of LLMs is avoided. These techniques work together to enable the framework to perform well in various document understanding tasks, especially when dealing with multi - scale document images.