FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs

Jing Hao,Yuxiang Zhao,Song Chen,Yanpeng Sun,Qiang Chen,Gang Zhang,Kun Yao,Errui Ding,Jingdong Wang
2024-09-20
Abstract:Multimodal Large Language Models (MLLMs) have shown promise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they heavily depend on high-quality data in the Supervised Fine-Tuning (SFT) phase. The existing approaches aim to curate high-quality data via GPT-4V, but they are not scalable due to the commercial nature of GPT-4V and the simplicity of the prompts used to instruct the model. To this end, we devised the FullAnno system, which is a data engine that can generate large-scale, high-quality, and fine-grained image annotations consisting of the category and position of objects, region descriptions, text information, as well as image dense captions. This engine is characterized by its cascade annotation process, which involves multiple expert models and employs rich prompts to instruct LLMs in generating dense image captions. We re-annotated the COCO and Visual Genome datasets using our FullAnno system, tripling the number of object annotations and increasing the length of the original image captions by a factor of 15. Experiments show that the regenerated annotation can significantly enhance the capabilities of LLaVA-v1.5 on several benchmarks. The re-annotated data are available at: <a class="link-external link-https" href="https://arcana-project-page.github.io" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to address is the high dependency of Multimodal Large Language Models (MLLMs) on high-quality data in vision-language tasks. Existing methods, although capable of generating high-quality data through the GPT-4 Vision model, suffer from poor scalability, primarily due to the commercial nature of GPT-4 Vision and the simple prompts used to guide the model. To overcome these issues, the authors designed a data engine named FullAnno, which can automatically generate large-scale, high-quality, and fine-grained image annotation data. Specifically, the main contributions of the paper include: 1. **Designing the FullAnno data engine**: It generates detailed image annotations through multiple expert models and rich prompts, including object categories and locations, region descriptions, text information, and dense image captions. 2. **Re-annotating the COCO and Visual Genome datasets**: Increasing the number of object annotations and significantly expanding the length of the original image captions. 3. **Validating the effectiveness of the re-annotated data**: Experiments demonstrate that the re-annotated data can significantly improve the performance of LLaVA-v1.5 on multiple benchmarks. Overall, this paper aims to improve the performance of multimodal large language models in vision-language tasks by generating high-quality image annotation data.