Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

Geewook Kim,Hodong Lee,Daehee Kim,Haeji Jung,Sanghee Park,Yoonsik Kim,Sangdoo Yun,Taeho Kil,Bado Lee,Seunghyun Park
DOI: https://doi.org/10.48550/arXiv.2305.15080
2023-10-26
Abstract:Recent advances in Large Language Models (LLMs) have stimulated a surge of research aimed at extending their applications to the visual domain. While these models exhibit promise in generating abstract image captions and facilitating natural conversations, their performance on text-rich images still requires improvement. In this paper, we introduce Contrastive Reading Model (Cream), a novel neural architecture designed to enhance the language-image understanding capability of LLMs by capturing intricate details that are often overlooked in existing methods. Cream combines vision and auxiliary encoders, fortified by a contrastive feature alignment technique, to achieve a more effective comprehension of language information in visually situated contexts within the images. Our approach bridges the gap between vision and language understanding, paving the way for the development of more sophisticated Document Intelligence Assistants. Through rigorous evaluations across diverse visually-situated language understanding tasks that demand reasoning capabilities, we demonstrate the compelling performance of Cream, positioning it as a prominent model in the field of visual document understanding. We provide our codebase and newly-generated datasets at <a class="link-external link-https" href="https://github.com/naver-ai/cream" rel="external noopener nofollow">this https URL</a> .
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the issue of poor performance of existing large visual language models (LVLMs) in visual document understanding (VDU) tasks when dealing with text-rich images. Specifically, these models have limitations in extracting fine-grained features from images, which leads to unsatisfactory solutions for tasks that require comprehensive analysis of various information such as text, objects, and layout. For example, in the Document Visual Question Answering (Document VQA) task, the performance of existing models is constrained. To overcome these limitations, the paper proposes a new neural architecture—the Contrastive Reading Model (Cream). Cream enhances the language-image understanding capability of large language models (LLMs) in text-rich images by combining a visual encoder and an auxiliary encoder, and employing contrastive feature alignment techniques. The model aims to capture subtle details often overlooked by existing methods, thereby achieving more effective understanding of language information in images. Through a series of rigorous evaluations, the paper demonstrates Cream's outstanding performance in various visual context natural language understanding tasks that require reasoning capabilities, particularly in the field of visual document understanding.