VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models

Chenyu Zhou,Mengdan Zhang,Peixian Chen,Chaoyou Fu,Yunhang Shen,Xiawu Zheng,Xing Sun,Rongrong Ji
2024-06-15
Abstract:The swift progress of Multi-modal Large Models (MLLMs) has showcased their impressive ability to tackle tasks blending vision and language. Yet, most current models and benchmarks cater to scenarios with a narrow scope of visual and textual contexts. These models often fall short when faced with complex comprehension tasks, which involve navigating through a plethora of irrelevant and potentially misleading information in both text and image forms. To bridge this gap, we introduce a new, more demanding task known as Interleaved Image-Text Comprehension (IITC). This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions and to follow intricate instructions to pinpoint the relevant image. In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA), to refine image-text correlation skills. Our evaluation of four leading closed-source models, as well as various open-source models using VEGA, underscores the rigorous nature of IITC. Even the most advanced models, such as Gemini-1.5-pro and GPT4V, only achieved modest success. By employing a multi-task, multi-scale post-training strategy, we have set a robust baseline for MLLMs on the IITC task, attaining an $85.8\%$ accuracy rate in image association and a $0.508$ Rouge score. These results validate the effectiveness of our dataset in improving MLLMs capabilities for nuanced image-text comprehension.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
This paper proposes a solution to the challenges faced by Multi-Modal Large Models (MLLMs) in complex visual and language understanding tasks. Current models and benchmarks mainly focus on narrow visual and textual contexts and often perform poorly when faced with complex understanding tasks that involve a large amount of irrelevant and potentially misleading information. To address this, the paper introduces a new task called Interleaved Image-Text Comprehension (IITC), which requires models to identify and ignore redundant elements in both images and text in order to accurately answer questions and follow complex instructions to find relevant images. To support this new task, the paper constructs a new dataset named VEGA, specifically designed for the IITC task, particularly in scientific content. In addition, a sub-task called Image-Text Association (ITA) is designed to enhance the model's ability to associate images with text. Through evaluations on several leading proprietary and open-source models on the VEGA dataset, the authors highlight the rigor of the IITC task, with even state-of-the-art models like Gemini-1.5-pro and GPT4V achieving only limited success. The paper establishes benchmarks for MLLMs on the IITC task by adopting a multi-task, multi-scale fine-tuning strategy, achieving an image association accuracy of 85.8% and a ROUGE score of 0.508. These results validate the effectiveness of the VEGA dataset in improving models' ability to understand subtle image-text relationships.