Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

Geewook Kim,Hodong Lee,Daehee Kim,Haeji Jung,Sanghee Park,Yoonsik Kim,Sangdoo Yun,Taeho Kil,Bado Lee,Seunghyun Park

DOI: https://doi.org/10.48550/arXiv.2305.15080

2023-10-26

Abstract:Recent advances in Large Language Models (LLMs) have stimulated a surge of research aimed at extending their applications to the visual domain. While these models exhibit promise in generating abstract image captions and facilitating natural conversations, their performance on text-rich images still requires improvement. In this paper, we introduce Contrastive Reading Model (Cream), a novel neural architecture designed to enhance the language-image understanding capability of LLMs by capturing intricate details that are often overlooked in existing methods. Cream combines vision and auxiliary encoders, fortified by a contrastive feature alignment technique, to achieve a more effective comprehension of language information in visually situated contexts within the images. Our approach bridges the gap between vision and language understanding, paving the way for the development of more sophisticated Document Intelligence Assistants. Through rigorous evaluations across diverse visually-situated language understanding tasks that demand reasoning capabilities, we demonstrate the compelling performance of Cream, positioning it as a prominent model in the field of visual document understanding. We provide our codebase and newly-generated datasets at <a class="link-external link-https" href="https://github.com/naver-ai/cream" rel="external noopener nofollow">this https URL</a> .

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The paper attempts to address the issue of poor performance of existing large visual language models (LVLMs) in visual document understanding (VDU) tasks when dealing with text-rich images. Specifically, these models have limitations in extracting fine-grained features from images, which leads to unsatisfactory solutions for tasks that require comprehensive analysis of various information such as text, objects, and layout. For example, in the Document Visual Question Answering (Document VQA) task, the performance of existing models is constrained. To overcome these limitations, the paper proposes a new neural architecture—the Contrastive Reading Model (Cream). Cream enhances the language-image understanding capability of large language models (LLMs) in text-rich images by combining a visual encoder and an auxiliary encoder, and employing contrastive feature alignment techniques. The model aims to capture subtle details often overlooked by existing methods, thereby achieving more effective understanding of language information in images. Through a series of rigorous evaluations, the paper demonstrates Cream's outstanding performance in various visual context natural language understanding tasks that require reasoning capabilities, particularly in the field of visual document understanding.

Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

Beyond Text: Frozen Large Language Models in Visual Signal Comprehension

InfMLLM: A Unified Framework for Visual-Language Tasks.

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and Reasoning

LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding

LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

Improving Context Understanding in Multimodal Large Language Models Via Multimodal Composition Learning

X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs

LM4LV: A Frozen Large Language Model for Low-level Vision Tasks

CoLLaVO: Crayon Large Language and Vision mOdel

Do better language models have crisper vision?

Learning the Visualness of Text Using Large Vision-Language Models

EVLM: An Efficient Vision-Language Model for Visual Understanding

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants

Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering

Multimodal Food Image Classification with Large Language Models

CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large Language Models

Enhancing Advanced Visual Reasoning Ability of Large Language Models