TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

Yuliang Liu,Biao Yang,Qiang Liu,Zhang Li,Zhiyin Ma,Shuo Zhang,Xiang Bai

2024-03-15

Abstract:We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks. Our approach introduces enhancement across several dimensions: By adopting Shifted Window Attention with zero-initialization, we achieve cross-window connectivity at higher input resolutions and stabilize early training; We hypothesize that images may contain redundant tokens, and by using similarity to filter out significant tokens, we can not only streamline the token length but also enhance the model's performance. Moreover, by expanding our model's capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability. It also learns to perform screenshot tasks through finetuning. Evaluation on 12 benchmarks shows notable improvements: 5.2% in Scene Text-Centric tasks (including STVQA, TextVQA, and OCRVQA), 6.9% in Document-Oriented tasks (such as DocVQA, InfoVQA, ChartVQA, DeepForm, Kleister Charity, and WikiTableQuestions), and 2.8% in Key Information Extraction tasks (comprising FUNSD, SROIE, and POIE). It outperforms in scene text spotting with a 10.9\% increase and sets a new standard on OCRBench, a comprehensive benchmark consisting of 29 OCR-related assessments, with a score of 561, surpassing previous open-sourced large multimodal models for document understanding. Code will be released at

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to address the problem of extracting key information from various sources, including tables, forms, invoices, and wild text, with the goal of automating and optimizing workflows based on document and scene text. Specifically, the paper proposes a large-scale multimodal model (LMM) named TextMonkey, which is optimized for text-centric tasks. The model addresses the limitations of existing methods in handling high-resolution images, particularly the inadequacies in recognizing small text. The main improvements include: 1. **Enhancing Cross-Window Relationships**: By adopting the Shifted Window Attention mechanism and zero initialization technique, connections between different windows are achieved, thereby improving training stability at input resolution. 2. **Compressing Redundant Tokens**: A Token Resampler is introduced, which filters important tokens through similarity, reducing token length and enhancing model performance. 3. **Supporting Text Localization Tasks**: The model's capabilities are extended beyond text question answering to include tasks such as text recognition and localization, incorporating positional information into the answers to enhance the model's interpretability and reliability. Experimental results show that TextMonkey performs excellently in multiple benchmarks, particularly in scene text question answering, document-related tasks, and key information extraction, achieving a high score of 561 in the OCRBench benchmark, surpassing existing open-source large-scale multimodal models. This indicates that TextMonkey has strong effectiveness and advancement in the field of document analysis and understanding.

TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

On the Hidden Mystery of OCR in Large Multimodal Models

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models

DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding

Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid

CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy

UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding

Structured Multimodal Attentions for TextVQA

TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

MMR: Evaluating Reading Ability of Large Multimodal Models

Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding

TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens

From Text to Pixel: Advancing Long-Context Understanding in MLLMs

DocLayLLM: An Efficient and Effective Multi-modal Extension of Large Language Models for Text-rich Document Understanding

mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding