VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models

Chenyu Zhou,Mengdan Zhang,Peixian Chen,Chaoyou Fu,Yunhang Shen,Xiawu Zheng,Xing Sun,Rongrong Ji

2024-06-15

Abstract:The swift progress of Multi-modal Large Models (MLLMs) has showcased their impressive ability to tackle tasks blending vision and language. Yet, most current models and benchmarks cater to scenarios with a narrow scope of visual and textual contexts. These models often fall short when faced with complex comprehension tasks, which involve navigating through a plethora of irrelevant and potentially misleading information in both text and image forms. To bridge this gap, we introduce a new, more demanding task known as Interleaved Image-Text Comprehension (IITC). This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions and to follow intricate instructions to pinpoint the relevant image. In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA), to refine image-text correlation skills. Our evaluation of four leading closed-source models, as well as various open-source models using VEGA, underscores the rigorous nature of IITC. Even the most advanced models, such as Gemini-1.5-pro and GPT4V, only achieved modest success. By employing a multi-task, multi-scale post-training strategy, we have set a robust baseline for MLLMs on the IITC task, attaining an $85.8\%$ accuracy rate in image association and a $0.508$ Rouge score. These results validate the effectiveness of our dataset in improving MLLMs capabilities for nuanced image-text comprehension.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

This paper proposes a solution to the challenges faced by Multi-Modal Large Models (MLLMs) in complex visual and language understanding tasks. Current models and benchmarks mainly focus on narrow visual and textual contexts and often perform poorly when faced with complex understanding tasks that involve a large amount of irrelevant and potentially misleading information. To address this, the paper introduces a new task called Interleaved Image-Text Comprehension (IITC), which requires models to identify and ignore redundant elements in both images and text in order to accurately answer questions and follow complex instructions to find relevant images. To support this new task, the paper constructs a new dataset named VEGA, specifically designed for the IITC task, particularly in scientific content. In addition, a sub-task called Image-Text Association (ITA) is designed to enhance the model's ability to associate images with text. Through evaluations on several leading proprietary and open-source models on the VEGA dataset, the authors highlight the rigor of the IITC task, with even state-of-the-art models like Gemini-1.5-pro and GPT4V achieving only limited success. The paper establishes benchmarks for MLLMs on the IITC task by adopting a multi-task, multi-scale fine-tuning strategy, achieving an image association accuracy of 85.8% and a ROUGE score of 0.508. These results validate the effectiveness of the VEGA dataset in improving models' ability to understand subtle image-text relationships.

VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Beyond Text: Frozen Large Language Models in Visual Signal Comprehension

Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks

VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning

Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?

InfMLLM: A Unified Framework for Visual-Language Tasks.

VIGC: Visual Instruction Generation and Correction

ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective

Lost in Translation: When GPT-4V(ision) Can't See Eye to Eye with Text. A Vision-Language-Consistency Analysis of VLLMs and Beyond

Can MLLMs Perform Text-to-Image In-Context Learning?

COCO is "ALL'' You Need for Visual Instruction Fine-tuning

EVLM: An Efficient Vision-Language Model for Visual Understanding

Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark

Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy