TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

Yuliang Liu,Biao Yang,Qiang Liu,Zhang Li,Zhiyin Ma,Shuo Zhang,Xiang Bai
2024-03-15
Abstract:We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks. Our approach introduces enhancement across several dimensions: By adopting Shifted Window Attention with zero-initialization, we achieve cross-window connectivity at higher input resolutions and stabilize early training; We hypothesize that images may contain redundant tokens, and by using similarity to filter out significant tokens, we can not only streamline the token length but also enhance the model's performance. Moreover, by expanding our model's capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability. It also learns to perform screenshot tasks through finetuning. Evaluation on 12 benchmarks shows notable improvements: 5.2% in Scene Text-Centric tasks (including STVQA, TextVQA, and OCRVQA), 6.9% in Document-Oriented tasks (such as DocVQA, InfoVQA, ChartVQA, DeepForm, Kleister Charity, and WikiTableQuestions), and 2.8% in Key Information Extraction tasks (comprising FUNSD, SROIE, and POIE). It outperforms in scene text spotting with a 10.9\% increase and sets a new standard on OCRBench, a comprehensive benchmark consisting of 29 OCR-related assessments, with a score of 561, surpassing previous open-sourced large multimodal models for document understanding. Code will be released at
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the problem of extracting key information from various sources, including tables, forms, invoices, and wild text, with the goal of automating and optimizing workflows based on document and scene text. Specifically, the paper proposes a large-scale multimodal model (LMM) named TextMonkey, which is optimized for text-centric tasks. The model addresses the limitations of existing methods in handling high-resolution images, particularly the inadequacies in recognizing small text. The main improvements include: 1. **Enhancing Cross-Window Relationships**: By adopting the Shifted Window Attention mechanism and zero initialization technique, connections between different windows are achieved, thereby improving training stability at input resolution. 2. **Compressing Redundant Tokens**: A Token Resampler is introduced, which filters important tokens through similarity, reducing token length and enhancing model performance. 3. **Supporting Text Localization Tasks**: The model's capabilities are extended beyond text question answering to include tasks such as text recognition and localization, incorporating positional information into the answers to enhance the model's interpretability and reliability. Experimental results show that TextMonkey performs excellently in multiple benchmarks, particularly in scene text question answering, document-related tasks, and key information extraction, achieving a high score of 561 in the OCRBench benchmark, surpassing existing open-source large-scale multimodal models. This indicates that TextMonkey has strong effectiveness and advancement in the field of document analysis and understanding.