OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Qingyun Li,Zhe Chen,Weiyun Wang,Wenhai Wang,Shenglong Ye,Zhenjiang Jin,Guanzhou Chen,Yinan He,Zhangwei Gao,Erfei Cui,Jiashuo Yu,Hao Tian,Jiasheng Zhou,Chao Xu,Bin Wang,Xingjian Wei,Wei Li,Wenjian Zhang,Bo Zhang,Pinlong Cai,Licheng Wen,Xiangchao Yan,Zhenxiang Li,Pei Chu,Yi Wang,Min Dou,Changyao Tian,Xizhou Zhu,Lewei Lu,Yushi Chen,Junjun He,Zhongying Tu,Tong Lu,Yali Wang,Limin Wang,Dahua Lin,Yu Qiao,Botian Shi,Conghui He,Jifeng Dai

2024-07-12

Abstract:Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale and diversity of current image-text interleaved data restrict the development of multimodal large language models. In this paper, we introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset. Using an efficient data engine, we filter and extract large-scale high-quality documents, which contain 8.6 billion images and 1,696 billion text tokens. Compared to counterparts (e.g., MMC4, OBELICS), our dataset 1) has 15 times larger scales while maintaining good data quality; 2) features more diverse sources, including both English and non-English websites as well as video-centric websites; 3) is more flexible, easily degradable from an image-text interleaved format to pure text corpus and image-text pairs. Through comprehensive analysis and experiments, we validate the quality, usability, and effectiveness of the proposed dataset. We hope this could provide a solid data foundation for future multimodal model research. Code and data are released at <a class="link-external link-https" href="https://github.com/OpenGVLab/OmniCorpus" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limitations in scale and diversity of existing image - text interleaved datasets, which impede the development of multimodal large language models (MLLMs). Specifically: 1. **Limited data scale**: Existing image - text interleaved datasets are small in scale. The largest dataset only contains about 140 million documents, far smaller than existing pure - text or image - text pair datasets. 2. **Single data source**: Most existing image - text interleaved datasets are mainly sourced from English - language websites in Common Crawl (CC), which limits the diversity of content. 3. **Low data quality**: Existing datasets may lose the structure of the original document during processing, resulting in a lack of contextual details, thus affecting text quality and contextual richness. To solve these problems, the paper proposes OmniCorpus, a large - scale image - text interleaved dataset containing 8.6 billion images and 169.6 billion text tokens. By introducing diverse data sources (including non - English - language websites and video platforms), an efficient processing pipeline, and high - quality data filtering methods, OmniCorpus aims to provide a solid data foundation for future multimodal model research.

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

OmniCorpus: an Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text

CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models

OmniBench: Towards The Future of Universal Omni-Language Models

15M Multimodal Facial Image-Text Dataset

OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation

LAION-5B: An open large-scale dataset for training next generation image-text models

TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation

On the Hidden Mystery of OCR in Large Multimodal Models

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

DataComp: In search of the next generation of multimodal datasets

What's In My Big Data?

OmniCity: Omnipotent City Understanding with Multi-level and Multi-view Images

CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy

TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding