ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with Effective Evaluation Model

Jianghao Chen,Pu Jian,Tengxiao Xi,Dongyi Yi,Qianlong Du,Chenglin Ding,Guibo Zhu,Chengqing Zong,Jinqiao Wang,Jiajun Zhang
2023-11-10
Abstract:During the development of large language models (LLMs), the scale and quality of the pre-training data play a crucial role in shaping LLMs' capabilities. To accelerate the research of LLMs, several large-scale datasets, such as C4 [1], Pile [2], RefinedWeb [3] and WanJuan [4], have been released to the public. However, most of the released corpus focus mainly on English, and there is still lack of complete tool-chain for extracting clean texts from web data. Furthermore, fine-grained information of the corpus, e.g. the quality of each text, is missing. To address these challenges, we propose in this paper a new complete tool-chain EvalWeb to extract Chinese clean texts from noisy web data. First, similar to previous work, manually crafted rules are employed to discard explicit noisy texts from the raw crawled web contents. Second, a well-designed evaluation model is leveraged to assess the remaining relatively clean data, and each text is assigned a specific quality score. Finally, we can easily utilize an appropriate threshold to select the high-quality pre-training data for Chinese. Using our proposed approach, we release the largest and latest large-scale high-quality Chinese web text ChineseWebText, which consists of 1.42 TB and each text is associated with a quality score, facilitating the LLM researchers to choose the data according to the desired quality thresholds. We also release a much cleaner subset of 600 GB Chinese data with the quality exceeding 90%.
Computation and Language
What problem does this paper attempt to address?
The problem this paper attempts to address is: In the development of large-scale language models (LLMs), the scale and quality of pre-training data play a crucial role in shaping the model's capabilities. Although several large datasets such as C4, Pile, RefinedWeb, and WanJuan have been publicly released, these datasets mainly focus on English data and lack a complete toolchain to extract clean Chinese text from web data. Moreover, these datasets typically do not provide fine-grained information, such as quality scores for each piece of text, which limits researchers' ability to re-filter data based on the desired quality threshold. To address these challenges, the paper proposes a new complete toolchain, EvalWeb, for extracting high-quality Chinese text from noisy web data. Specifically, this toolchain includes the following steps: 1. **Preliminary Filtering**: Use manually designed rules to remove obvious noisy text, generating initial clean Chinese data. 2. **Quality Assessment**: Design a BERT-based quality assessment model to assign specific quality scores to the remaining relatively clean data. 3. **High-Quality Data Selection**: Select high-quality pre-training data by setting appropriate thresholds. Through this method, the paper releases the largest high-quality Chinese web text dataset to date, ChineseWebText, with a total data volume of 1.42 TB, and each piece of text is accompanied by a quality score. Additionally, a cleaner subset containing 600 GB of Chinese data with quality scores exceeding 90% is also released. These data, codes, and toolchains are available on the relevant websites.