World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering

Jiacong Wang,Bohong Wu,Haiyong Jiang,Xun Zhou,Xin Xiao,Haoyuan Guo,Jun Xiao
2024-09-30
Abstract:Recent advances in Vision-Language Models (VLMs) and the scarcity of high-quality multi-modal alignment data have inspired numerous researches on synthetic VLM data generation. The conventional norm in VLM data construction uses a mixture of specialists in caption and OCR, or stronger VLM APIs and expensive human annotation. In this paper, we present World to Code (W2C), a meticulously curated multi-modal data construction pipeline that organizes the final generation output into a Python code format. The pipeline leverages the VLM itself to extract cross-modal information via different prompts and filter the generated outputs again via a consistency filtering strategy. Experiments have demonstrated the high quality of W2C by improving various existing visual question answering and visual grounding benchmarks across different VLMs. Further analysis also demonstrates that the new code parsing ability of VLMs presents better cross-modal equivalence than the commonly used detail caption ability. Our code is available at <a class="link-external link-https" href="https://github.com/foundation-multimodal-models/World2Code" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to generate high - quality multimodal data in a self - instructed way, so as to reduce the dependence on expert mixing and expensive manual annotation, and improve the performance of vision - language models (VLMs) in various visual question answering (VQA) and visual localization tasks. ### Specific problem description 1. **Challenges in multimodal data generation**: - High - quality image - text pair annotation is costly and time - consuming. - Existing multimodal data generation methods rely on multiple expert systems and human feedback, and are difficult to scale and automate. 2. **Limitations of existing methods**: - Traditional methods use a mixture of expert systems (such as OCR, caption generation, etc.) or stronger VLM APIs for data generation, but these methods rely on expensive manual annotation. - Some methods combine open - source large language models (LLMs) and different visual experts to filter high - quality image - text pairs, but still require human intervention to filter the noisily generated data. 3. **Importance of consistency check**: - Recent research shows that for semantically similar prompts, the results generated by LLMs and VLMs should be consistent. Therefore, consistency checks can be used to filter noisily generated text and captions. ### Solutions proposed in the paper The paper proposes a self - instructed multimodal data generation pipeline named World to Code (W2C), aiming to solve the problem in the following ways: 1. **Self - instructed data generation**: - Utilize existing VLMs to extract cross - modal information and generate data through different prompts. - Evaluate the quality of the generated output through a consistency filtering strategy to ensure that the generated data has high reliability. 2. **Reduce dependence on expert systems**: - W2C reduces the dependence on multiple expert systems and lowers the cost of generating data. - Through the automated self - instructed process, avoid expensive manual annotation. 3. **Code format organization**: - Organize the generated data into Python code format, which improves the data's degree of structuring and interpretability. 4. **Experimental verification**: - Experimental results show that W2C significantly improves the performance of existing VLMs on multiple VQA and visual localization benchmarks. - In particular, on benchmarks such as GQA and MME, W2C has achieved an accuracy improvement of more than 5% in few - shot evaluations. ### Summary The main contribution of this paper is to propose an innovative self - instructed multimodal data generation pipeline W2C, which not only reduces the dependence on expert systems and manual annotation, but also improves the quality of the generated data through a consistency filtering strategy, thereby significantly improving the performance of VLMs in various visual question answering and visual localization tasks.