Abstract:In recent years, notable advancements have been made in the domain of visual document understanding, with the prevailing architecture comprising a cascade of vision and language models. The text component can either be extracted explicitly with the use of external OCR models in OCR-based approaches, or alternatively, the vision model can be endowed with reading capabilities in OCR-free approaches. Typically, the queries to the model are input exclusively to the language component, necessitating the visual features to encompass the entire document. In this paper, we present VisFocus, an OCR-free method designed to better exploit the vision encoder's capacity by coupling it directly with the language prompt. To do so, we replace the down-sampling layers with layers that receive the input prompt and allow highlighting relevant parts of the document, while disregarding others. We pair the architecture enhancements with a novel pre-training task, using language masking on a snippet of the document text fed to the visual encoder in place of the prompt, to empower the model with focusing capabilities. Consequently, VisFocus learns to allocate its attention to text patches pertinent to the provided prompt. Our experiments demonstrate that this prompt-guided visual encoding approach significantly improves performance, achieving state-of-the-art results on various benchmarks.

What problem does this paper attempt to address?

The paper primarily focuses on addressing two key issues in the field of Visual Document Understanding (VDU): 1. **Improving the Performance of OCR-Free Methods**: Traditional VDU approaches rely on Optical Character Recognition (OCR) technology to extract text information from documents, which adds extra time and computational costs during training and inference. Moreover, OCR errors can propagate to subsequent Vision-Language (VL) models, affecting overall performance. Therefore, the paper proposes an OCR-Free method designed to directly process document images without the need for explicit OCR steps. 2. **Enhancing Model Responsiveness to User Queries**: In existing OCR-Free methods, user queries are typically only input into the language model, while visual features are processed independently of the query. This can result in visual features containing information irrelevant to the query, especially in dense documents. The paper introduces a new approach—VisFocus, which enables the visual model to focus more on the relevant parts of the document based on the user's query, thereby improving performance. ### Solution Overview - **VisFocus Method**: By incorporating a new layer into the visual encoder—the Vision-Language Merge Attention Layer (ViLMA), the visual model can directly receive user queries and highlight relevant parts of the document based on the query. Additionally, a new pre-training task—Local Masked Patch Modeling (LMPM), is introduced to guide the model to focus on text segments relevant to the query. - **Experimental Results**: Through experiments on multiple benchmark datasets, the paper demonstrates the advantages of the VisFocus method over existing approaches, particularly achieving significant improvements in dense document understanding tasks. In summary, the paper resolves issues present in current OCR-Free methods by introducing the ViLMA layer and LMPM pre-training task, enhancing the model's responsiveness to user queries and overall performance in document understanding.

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Focus Anywhere for Fine-grained Multi-page Document Understanding

DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding

ColPali: Efficient Document Retrieval with Vision Language Models

GiVE: Guiding Visual Encoder to Perceive Overlooked Information

On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and Reasoning

Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval

Efficient OCR for Building a Diverse Digital History

Read Extensively, Focus Smartly: A Cross-document Semantic Enhancement Method for Visual Documents NER.

GlobalDoc: A Cross-Modal Vision-Language Framework for Real-World Document Image Retrieval and Classification

Beyond OCR + VQA: Towards End-to-End Reading and Reasoning for Robust and Accurate TextVQA

Levenshtein OCR

Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs

Prompting Visual-Language Models for Efficient Video Understanding

EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature Refinement and Regularized Image-Text Alignment

EfficientOCR: An Extensible, Open-Source Package for Efficiently Digitizing World Knowledge

Visual In-Context Prompting

Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models

PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data

Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection

Efficient, Lexicon-Free OCR using Deep Learning