Abstract:The rise in popularity of ChatGPT and GPT-4 has significantly accelerated the development of large models, leading to the creation of numerous impressive large language models(LLMs) and multimodal large language models (MLLMs). These cutting-edge models owe their remarkable performance to high-quality data. However, the details of the training data used in leading paradigms are often kept confidential. This lack of transparency, coupled with the scarcity of open-source data, impedes further developments within the community. As a response, this paper presents "Wan Juan", a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources. The dataset incorporates text, image-text, and video modalities, with a total volume exceeding 2TB. It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale. All data can be accessed at <a class="link-external link-https" href="https://opendatalab.org.cn/WanJuan1.0" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper attempts to address the issue of the lack of transparency and open-source nature of the training data for current large language models (LLMs) and multimodal large language models (MLLMs). Specifically, although these models have made significant progress in performance, the details of their training data are often kept confidential, which hinders further development within the community. To address this issue, the paper introduces the "WanJuan" dataset, a large-scale multimodal dataset containing both Chinese and English data, collected from various web sources, aimed at providing high-quality, diverse data resources for training large models. ### Main Contributions: 1. **Construction of a Large-Scale Multimodal Dataset**: - **Text Data**: Includes over 600 million documents, with a data storage volume exceeding 1TB. - **Image-Text Data**: Processed into document form, totaling over 22 million documents, with a data volume exceeding 200GB (excluding images). - **Video Data**: Totals over 1000 videos, with a data volume exceeding 900GB. 2. **Ensuring Data Security and High Quality**: - Through algorithmic processing and manual verification, content such as pornography, violence, and bias is filtered out to ensure data security and value consistency. 3. **Providing Unified JSON Format Processing Tools and Support Documentation**: - Facilitates users in quickly applying large model training. ### Dataset Statistics: - **Text Data**: Sourced from webpages, encyclopedias, books, patents, textbooks, and exam questions, covering multiple fields such as technology, literature, media, education, and law. - **Image-Text Data**: Mainly sourced from official media news and user-generated articles, covering various fields such as news events, people, natural landscapes, and social life. - **Video Data**: Sourced from high-quality program clips from China Media Group and Shanghai Media Group, covering fields such as military, art, sports, nature, knowledge, film art, media, food, history, science, and education. ### Methods: - **Text Data Cleaning**: Extracting text from raw WARC files, classifying using language detection tools, filtering out invalid data, removing low-quality content, and performing deduplication. - **Image-Text Data Cleaning**: Extracting required content from official sources, removing invalid content such as ads, lists, navigation bars, emojis, and comments, and retaining meaningful image and text paragraphs. - **Video Data**: Mainly focusing on cleaning text and text-image data to ensure data quality and security. Through these methods, the paper provides high-quality, diverse data resources that contribute to advancing research in the fields of natural language processing and computer vision, especially in tasks requiring cross-modal understanding and content generation.

WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models

YuLan: An Open-source Large Language Model

WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset

WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages

LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

A Survey of Multimodal Large Language Model from A Data-centric Perspective

GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI

Baichuan 2: Open Large-scale Language Models

MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding

A Survey on Multimodal Large Language Models

Efficient Multimodal Large Language Models: A Survey

ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information

Multimodal Large Language Models: A Survey

BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

YUAN 2.0: A Large Language Model with Localized Filtering-based Attention

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

A Survey on Multimodal Benchmarks: In the Era of Large AI Models

A Survey on Benchmarks of Multimodal Large Language Models