Abstract:How well can Multimodal Large Language Models (MLLMs) understand composite images? Composite images (CIs) are synthetic visuals created by merging multiple visual elements, such as charts, posters, or screenshots, rather than being captured directly by a camera. While CIs are prevalent in real-world applications, recent MLLM developments have primarily focused on interpreting natural images (NIs). Our research reveals that current MLLMs face significant challenges in accurately understanding CIs, often struggling to extract information or perform complex reasoning based on these images. We find that existing training data for CIs are mostly formatted for question-answer tasks (e.g., in datasets like ChartQA and ScienceQA), while high-quality image-caption datasets, critical for robust vision-language alignment, are only available for NIs. To bridge this gap, we introduce Composite Captions (CompCap), a flexible framework that leverages Large Language Models (LLMs) and automation tools to synthesize CIs with accurate and detailed captions. Using CompCap, we curate CompCap-118K, a dataset containing 118K image-caption pairs across six CI types. We validate the effectiveness of CompCap-118K by supervised fine-tuning MLLMs of three sizes: xGen-MM-inst.-4B and LLaVA-NeXT-Vicuna-7B/13B. Empirical results show that CompCap-118K significantly enhances MLLMs' understanding of CIs, yielding average gains of 1.7%, 2.0%, and 2.9% across eleven benchmarks, respectively.

What problem does this paper attempt to address?

This paper attempts to address the limitations of Multimodal Large Language Models (MLLMs) in understanding and interpreting Composite Images (CIs). Specifically, current MLLMs are mainly trained on Natural Images (NIs) and show significant deficiencies when dealing with CIs, making it difficult to accurately extract information or perform complex reasoning. ### Main Problems 1. **Insufficient Comprehension Ability**: Existing MLLMs, when faced with CIs, are often only able to extract some accurate information and cannot achieve a comprehensive understanding. 2. **Data Shortage**: There is a lack of high - quality CI - caption pairs data for training MLLMs, which limits the model's ability in CI understanding. 3. **Mismatch of Existing Data Formats**: Most of the existing CI training data is mainly in the form of question - answering tasks (such as ChartQA and ScienceQA), lacking high - quality image - caption datasets, which are crucial for visual - language alignment. ### Solutions To solve the above problems, the authors propose the CompCap framework. It generates composite images with detailed captions through automated tools and constructs a dataset named CompCap - 118K, which contains 118,000 CI - caption pairs covering six different types of composite images. The specific steps are as follows: 1. **Data Generation**: Synthesize composite images and their corresponding captions using metadata (such as image - caption pairs, layout information, text or table data). 2. **Framework Design**: The CompCap framework can flexibly generate various types of composite images and ensure that the generated captions are both accurate and detailed. 3. **Experimental Verification**: Verify the effectiveness of the CompCap - 118K dataset through supervised fine - tuning of three MLLMs of different scales (xGen - MM - inst. - 4B, LLaVA - NeXT - Vicuna - 7B/13B). ### Experimental Results The experimental results show that after being trained with the CompCap - 118K dataset, the performance of MLLMs in multiple benchmark tests has been significantly improved, with the average performance increased by 1.7%, 2.0% and 2.9% respectively. ### Summary The main contributions of this research include: - Pointing out the limitations of existing MLLMs in understanding composite images. - Proposing and implementing the CompCap framework to generate high - quality composite images and captions. - Constructing the CompCap - 118K dataset, which significantly improves the MLLMs' ability to understand composite images. Through these improvements, the performance of MLLMs in dealing with composite images has been significantly enhanced, providing a solid foundation for further research and applications.

CompCap: Improving Multimodal Large Language Models with Composite Captions

CapsFusion: Rethinking Image-Text Data at Scale

Improving Context Understanding in Multimodal Large Language Models Via Multimodal Composition Learning

LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation

A Multi-task Learning Approach for Image Captioning.

LCM-Captioner: A lightweight text-based image captioning method with collaborative mechanism between vision and text

The nature of respiratory changes associated with sleep onset.

InfMLLM: A Unified Framework for Visual-Language Tasks.

Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis

MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

Model Composition for Multimodal Large Language Models

FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models

Improving Multimodal Datasets with Image Captioning

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

CLAIR: Evaluating Image Captions with Large Language Models

FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions

Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models