WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models

Conghui He,Zhenjiang Jin,Chao Xu,Jiantao Qiu,Bin Wang,Wei Li,Hang Yan,Jiaqi Wang,Dahua Lin
2023-09-15
Abstract:The rise in popularity of ChatGPT and GPT-4 has significantly accelerated the development of large models, leading to the creation of numerous impressive large language models(LLMs) and multimodal large language models (MLLMs). These cutting-edge models owe their remarkable performance to high-quality data. However, the details of the training data used in leading paradigms are often kept confidential. This lack of transparency, coupled with the scarcity of open-source data, impedes further developments within the community. As a response, this paper presents "Wan Juan", a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources. The dataset incorporates text, image-text, and video modalities, with a total volume exceeding 2TB. It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale. All data can be accessed at <a class="link-external link-https" href="https://opendatalab.org.cn/WanJuan1.0" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the issue of the lack of transparency and open-source nature of the training data for current large language models (LLMs) and multimodal large language models (MLLMs). Specifically, although these models have made significant progress in performance, the details of their training data are often kept confidential, which hinders further development within the community. To address this issue, the paper introduces the "WanJuan" dataset, a large-scale multimodal dataset containing both Chinese and English data, collected from various web sources, aimed at providing high-quality, diverse data resources for training large models. ### Main Contributions: 1. **Construction of a Large-Scale Multimodal Dataset**: - **Text Data**: Includes over 600 million documents, with a data storage volume exceeding 1TB. - **Image-Text Data**: Processed into document form, totaling over 22 million documents, with a data volume exceeding 200GB (excluding images). - **Video Data**: Totals over 1000 videos, with a data volume exceeding 900GB. 2. **Ensuring Data Security and High Quality**: - Through algorithmic processing and manual verification, content such as pornography, violence, and bias is filtered out to ensure data security and value consistency. 3. **Providing Unified JSON Format Processing Tools and Support Documentation**: - Facilitates users in quickly applying large model training. ### Dataset Statistics: - **Text Data**: Sourced from webpages, encyclopedias, books, patents, textbooks, and exam questions, covering multiple fields such as technology, literature, media, education, and law. - **Image-Text Data**: Mainly sourced from official media news and user-generated articles, covering various fields such as news events, people, natural landscapes, and social life. - **Video Data**: Sourced from high-quality program clips from China Media Group and Shanghai Media Group, covering fields such as military, art, sports, nature, knowledge, film art, media, food, history, science, and education. ### Methods: - **Text Data Cleaning**: Extracting text from raw WARC files, classifying using language detection tools, filtering out invalid data, removing low-quality content, and performing deduplication. - **Image-Text Data Cleaning**: Extracting required content from official sources, removing invalid content such as ads, lists, navigation bars, emojis, and comments, and retaining meaningful image and text paragraphs. - **Video Data**: Mainly focusing on cleaning text and text-image data to ensure data quality and security. Through these methods, the paper provides high-quality, diverse data resources that contribute to advancing research in the fields of natural language processing and computer vision, especially in tasks requiring cross-modal understanding and content generation.