FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs

Jing Hao,Yuxiang Zhao,Song Chen,Yanpeng Sun,Qiang Chen,Gang Zhang,Kun Yao,Errui Ding,Jingdong Wang

2024-09-20

Abstract:Multimodal Large Language Models (MLLMs) have shown promise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they heavily depend on high-quality data in the Supervised Fine-Tuning (SFT) phase. The existing approaches aim to curate high-quality data via GPT-4V, but they are not scalable due to the commercial nature of GPT-4V and the simplicity of the prompts used to instruct the model. To this end, we devised the FullAnno system, which is a data engine that can generate large-scale, high-quality, and fine-grained image annotations consisting of the category and position of objects, region descriptions, text information, as well as image dense captions. This engine is characterized by its cascade annotation process, which involves multiple expert models and employs rich prompts to instruct LLMs in generating dense image captions. We re-annotated the COCO and Visual Genome datasets using our FullAnno system, tripling the number of object annotations and increasing the length of the original image captions by a factor of 15. Experiments show that the regenerated annotation can significantly enhance the capabilities of LLaVA-v1.5 on several benchmarks. The re-annotated data are available at: <a class="link-external link-https" href="https://arcana-project-page.github.io" rel="external noopener nofollow">this https URL</a>

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem this paper attempts to address is the high dependency of Multimodal Large Language Models (MLLMs) on high-quality data in vision-language tasks. Existing methods, although capable of generating high-quality data through the GPT-4 Vision model, suffer from poor scalability, primarily due to the commercial nature of GPT-4 Vision and the simple prompts used to guide the model. To overcome these issues, the authors designed a data engine named FullAnno, which can automatically generate large-scale, high-quality, and fine-grained image annotation data. Specifically, the main contributions of the paper include: 1. **Designing the FullAnno data engine**: It generates detailed image annotations through multiple expert models and rich prompts, including object categories and locations, region descriptions, text information, and dense image captions. 2. **Re-annotating the COCO and Visual Genome datasets**: Increasing the number of object annotations and significantly expanding the length of the original image captions. 3. **Validating the effectiveness of the re-annotated data**: Experiments demonstrate that the re-annotated data can significantly improve the performance of LLaVA-v1.5 on multiple benchmarks. Overall, this paper aims to improve the performance of multimodal large language models in vision-language tasks by generating high-quality image annotation data.

FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs

AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators

MEGAnno+: A Human-LLM Collaborative Annotation System

LLMGA: Multimodal Large Language Model based Generation Assistant

InfMLLM: A Unified Framework for Visual-Language Tasks.

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture

MLLM-DataEngine: An Iterative Refinement Approach for MLLM

CompCap: Improving Multimodal Large Language Models with Composite Captions

Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

Improving Visual Storytelling with Multimodal Large Language Models

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

Improving Multi-modal Large Language Model through Boosting Vision Capabilities

Model-in-the-Loop (MILO): Accelerating Multimodal AI Data Annotation with LLMs

What If We Recaption Billions of Web Images with LLaMA-3?

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks