Abstract:Recent advances in Vision-Language Models (VLMs) and the scarcity of high-quality multi-modal alignment data have inspired numerous researches on synthetic VLM data generation. The conventional norm in VLM data construction uses a mixture of specialists in caption and OCR, or stronger VLM APIs and expensive human annotation. In this paper, we present World to Code (W2C), a meticulously curated multi-modal data construction pipeline that organizes the final generation output into a Python code format. The pipeline leverages the VLM itself to extract cross-modal information via different prompts and filter the generated outputs again via a consistency filtering strategy. Experiments have demonstrated the high quality of W2C by improving various existing visual question answering and visual grounding benchmarks across different VLMs. Further analysis also demonstrates that the new code parsing ability of VLMs presents better cross-modal equivalence than the commonly used detail caption ability. Our code is available at <a class="link-external link-https" href="https://github.com/foundation-multimodal-models/World2Code" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to generate high - quality multimodal data in a self - instructed way, so as to reduce the dependence on expert mixing and expensive manual annotation, and improve the performance of vision - language models (VLMs) in various visual question answering (VQA) and visual localization tasks. ### Specific problem description 1. **Challenges in multimodal data generation**: - High - quality image - text pair annotation is costly and time - consuming. - Existing multimodal data generation methods rely on multiple expert systems and human feedback, and are difficult to scale and automate. 2. **Limitations of existing methods**: - Traditional methods use a mixture of expert systems (such as OCR, caption generation, etc.) or stronger VLM APIs for data generation, but these methods rely on expensive manual annotation. - Some methods combine open - source large language models (LLMs) and different visual experts to filter high - quality image - text pairs, but still require human intervention to filter the noisily generated data. 3. **Importance of consistency check**: - Recent research shows that for semantically similar prompts, the results generated by LLMs and VLMs should be consistent. Therefore, consistency checks can be used to filter noisily generated text and captions. ### Solutions proposed in the paper The paper proposes a self - instructed multimodal data generation pipeline named World to Code (W2C), aiming to solve the problem in the following ways: 1. **Self - instructed data generation**: - Utilize existing VLMs to extract cross - modal information and generate data through different prompts. - Evaluate the quality of the generated output through a consistency filtering strategy to ensure that the generated data has high reliability. 2. **Reduce dependence on expert systems**: - W2C reduces the dependence on multiple expert systems and lowers the cost of generating data. - Through the automated self - instructed process, avoid expensive manual annotation. 3. **Code format organization**: - Organize the generated data into Python code format, which improves the data's degree of structuring and interpretability. 4. **Experimental verification**: - Experimental results show that W2C significantly improves the performance of existing VLMs on multiple VQA and visual localization benchmarks. - In particular, on benchmarks such as GQA and MME, W2C has achieved an accuracy improvement of more than 5% in few - shot evaluations. ### Summary The main contribution of this paper is to propose an innovative self - instructed multimodal data generation pipeline W2C, which not only reduces the dependence on expert systems and manual annotation, but also improves the quality of the generated data through a consistency filtering strategy, thereby significantly improving the performance of VLMs in various visual question answering and visual localization tasks.

World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering

Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner

i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data

SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models

VIGC: Visual Instruction Generation and Correction

Visual Commonsense-Aware Representation Network for Video Captioning

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Cap4Video++: Enhancing Video Understanding with Auxiliary Captions

Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment

Enhanced Video Caption Generation Based on Multimodal Features.

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding

From Association to Generation: Text-only Captioning by Unsupervised Cross-modal Mapping

CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning

GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning

COGVLM: VISUAL EXPERT FOR LARGE LANGUAGE MODELS

LCM-Captioner: A lightweight text-based image captioning method with collaborative mechanism between vision and text

From Captions to Visual Concepts and Back