CompCap: Improving Multimodal Large Language Models with Composite Captions

Xiaohui Chen,Satya Narayan Shukla,Mahmoud Azab,Aashu Singh,Qifan Wang,David Yang,ShengYun Peng,Hanchao Yu,Shen Yan,Xuewen Zhang,Baosheng He
2024-12-07
Abstract:How well can Multimodal Large Language Models (MLLMs) understand composite images? Composite images (CIs) are synthetic visuals created by merging multiple visual elements, such as charts, posters, or screenshots, rather than being captured directly by a camera. While CIs are prevalent in real-world applications, recent MLLM developments have primarily focused on interpreting natural images (NIs). Our research reveals that current MLLMs face significant challenges in accurately understanding CIs, often struggling to extract information or perform complex reasoning based on these images. We find that existing training data for CIs are mostly formatted for question-answer tasks (e.g., in datasets like ChartQA and ScienceQA), while high-quality image-caption datasets, critical for robust vision-language alignment, are only available for NIs. To bridge this gap, we introduce Composite Captions (CompCap), a flexible framework that leverages Large Language Models (LLMs) and automation tools to synthesize CIs with accurate and detailed captions. Using CompCap, we curate CompCap-118K, a dataset containing 118K image-caption pairs across six CI types. We validate the effectiveness of CompCap-118K by supervised fine-tuning MLLMs of three sizes: xGen-MM-inst.-4B and LLaVA-NeXT-Vicuna-7B/13B. Empirical results show that CompCap-118K significantly enhances MLLMs' understanding of CIs, yielding average gains of 1.7%, 2.0%, and 2.9% across eleven benchmarks, respectively.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper attempts to address the limitations of Multimodal Large Language Models (MLLMs) in understanding and interpreting Composite Images (CIs). Specifically, current MLLMs are mainly trained on Natural Images (NIs) and show significant deficiencies when dealing with CIs, making it difficult to accurately extract information or perform complex reasoning. ### Main Problems 1. **Insufficient Comprehension Ability**: Existing MLLMs, when faced with CIs, are often only able to extract some accurate information and cannot achieve a comprehensive understanding. 2. **Data Shortage**: There is a lack of high - quality CI - caption pairs data for training MLLMs, which limits the model's ability in CI understanding. 3. **Mismatch of Existing Data Formats**: Most of the existing CI training data is mainly in the form of question - answering tasks (such as ChartQA and ScienceQA), lacking high - quality image - caption datasets, which are crucial for visual - language alignment. ### Solutions To solve the above problems, the authors propose the CompCap framework. It generates composite images with detailed captions through automated tools and constructs a dataset named CompCap - 118K, which contains 118,000 CI - caption pairs covering six different types of composite images. The specific steps are as follows: 1. **Data Generation**: Synthesize composite images and their corresponding captions using metadata (such as image - caption pairs, layout information, text or table data). 2. **Framework Design**: The CompCap framework can flexibly generate various types of composite images and ensure that the generated captions are both accurate and detailed. 3. **Experimental Verification**: Verify the effectiveness of the CompCap - 118K dataset through supervised fine - tuning of three MLLMs of different scales (xGen - MM - inst. - 4B, LLaVA - NeXT - Vicuna - 7B/13B). ### Experimental Results The experimental results show that after being trained with the CompCap - 118K dataset, the performance of MLLMs in multiple benchmark tests has been significantly improved, with the average performance increased by 1.7%, 2.0% and 2.9% respectively. ### Summary The main contributions of this research include: - Pointing out the limitations of existing MLLMs in understanding composite images. - Proposing and implementing the CompCap framework to generate high - quality composite images and captions. - Constructing the CompCap - 118K dataset, which significantly improves the MLLMs' ability to understand composite images. Through these improvements, the performance of MLLMs in dealing with composite images has been significantly enhanced, providing a solid foundation for further research and applications.