Abstract:Large multimodal language models have demonstrated impressive capabilities in understanding and manipulating images. However, many of these models struggle with comprehending intensive textual contents embedded within the images, primarily due to the limited text recognition and layout understanding ability. To understand the sources of these limitations, we perform an exploratory analysis showing the drawbacks of classical visual encoders on visual text understanding. Hence, we present LLaVA-Read, a multimodal large language model that utilizes dual visual encoders along with a visual text encoder. Our model surpasses existing state-of-the-art models in various text-rich image understanding tasks, showcasing enhanced comprehension of textual content within images. Together, our research suggests visual text understanding remains an open challenge and an efficient visual text encoder is crucial for future successful multimodal systems.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper aims to address the difficulties multimodal language models encounter when understanding dense textual content in images. Although existing large multimodal language models perform excellently in understanding and processing images, they still have limitations in recognizing and comprehending a large amount of textual content within images. Specifically, these models perform poorly in the following aspects: 1. **Limited Text Recognition Ability**: Traditional visual encoders are not effective in recognizing textual content in images, especially when dealing with a large number of text blocks. 2. **Insufficient Layout Understanding Ability**: Existing multimodal models struggle to understand the layout information of text within images, which affects the overall comprehension of the textual content. To overcome these issues, the paper proposes the **LLaVA-Read** model, which enhances the understanding of textual content in images by introducing dual visual encoders and a visual text encoder. The specific improvements include: - **Dual Visual Encoders**: One low-resolution encoder captures global visual information, while another high-resolution encoder captures detailed visual information. - **Visual Text Encoder**: A lightweight visual text encoder (such as OCR tools) is used to extract textual content and its positional information from high-resolution images. - **Fusion Module**: The fusion module merges information from the high-resolution encoder into the low-resolution encoder to reduce the number of visual tokens and improve the model's efficiency. - **Layout-Aware Pretraining and Fine-Tuning**: Layout-aware pretraining and instruction fine-tuning tasks enhance the collaboration between multiple visual encoders, improving the understanding of text-rich images. With these improvements, LLaVA-Read performs excellently in various text-rich image understanding tasks, surpassing existing state-of-the-art models.

LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

EVLM: An Efficient Vision-Language Model for Visual Understanding

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture

VLLaVO: Mitigating Visual Gap through LLMs

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

Enhancing Advanced Visual Reasoning Ability of Large Language Models

LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness

LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models

Visually-Augmented Language Modeling

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding