ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with Effective Evaluation Model

Jianghao Chen,Pu Jian,Tengxiao Xi,Dongyi Yi,Qianlong Du,Chenglin Ding,Guibo Zhu,Chengqing Zong,Jinqiao Wang,Jiajun Zhang

2023-11-10

Abstract:During the development of large language models (LLMs), the scale and quality of the pre-training data play a crucial role in shaping LLMs' capabilities. To accelerate the research of LLMs, several large-scale datasets, such as C4 [1], Pile [2], RefinedWeb [3] and WanJuan [4], have been released to the public. However, most of the released corpus focus mainly on English, and there is still lack of complete tool-chain for extracting clean texts from web data. Furthermore, fine-grained information of the corpus, e.g. the quality of each text, is missing. To address these challenges, we propose in this paper a new complete tool-chain EvalWeb to extract Chinese clean texts from noisy web data. First, similar to previous work, manually crafted rules are employed to discard explicit noisy texts from the raw crawled web contents. Second, a well-designed evaluation model is leveraged to assess the remaining relatively clean data, and each text is assigned a specific quality score. Finally, we can easily utilize an appropriate threshold to select the high-quality pre-training data for Chinese. Using our proposed approach, we release the largest and latest large-scale high-quality Chinese web text ChineseWebText, which consists of 1.42 TB and each text is associated with a quality score, facilitating the LLM researchers to choose the data according to the desired quality thresholds. We also release a much cleaner subset of 600 GB Chinese data with the quality exceeding 90%.

Computation and Language

What problem does this paper attempt to address?

The problem this paper attempts to address is: In the development of large-scale language models (LLMs), the scale and quality of pre-training data play a crucial role in shaping the model's capabilities. Although several large datasets such as C4, Pile, RefinedWeb, and WanJuan have been publicly released, these datasets mainly focus on English data and lack a complete toolchain to extract clean Chinese text from web data. Moreover, these datasets typically do not provide fine-grained information, such as quality scores for each piece of text, which limits researchers' ability to re-filter data based on the desired quality threshold. To address these challenges, the paper proposes a new complete toolchain, EvalWeb, for extracting high-quality Chinese text from noisy web data. Specifically, this toolchain includes the following steps: 1. **Preliminary Filtering**: Use manually designed rules to remove obvious noisy text, generating initial clean Chinese data. 2. **Quality Assessment**: Design a BERT-based quality assessment model to assign specific quality scores to the remaining relatively clean data. 3. **High-Quality Data Selection**: Select high-quality pre-training data by setting appropriate thresholds. Through this method, the paper releases the largest high-quality Chinese web text dataset to date, ChineseWebText, with a total data volume of 1.42 TB, and each piece of text is accompanied by a quality score. Additionally, a cleaner subset containing 600 GB of Chinese data with quality scores exceeding 90% is also released. These data, codes, and toolchains are available on the relevant websites.

ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with Effective Evaluation Model

ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information

FineWeb-zhtw: Scalable Curation of Traditional Chinese Text Data from the Web

WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

WYWEB: A NLP Evaluation Benchmark For Classical Chinese

Leveraging Web-Crawled Data for High-Quality Fine-Tuning

CT-Eval: Benchmarking Chinese Text-to-Table Performance in Large Language Models

The Web Can Be Your Oyster for Improving Large Language Models

Building a Large Japanese Web Corpus for Large Language Models

WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models

CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for pre-training large language models

LongWanjuan: Towards Systematic Measurement for Long Text Quality

Cleaner Pretraining Corpus Curation with Neural Web Scraping

CLEVA: Chinese Language Models EVAluation Platform

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study

WebCPM: Interactive Web Search for Chinese Long-form Question Answering.

E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

CJEval: A Benchmark for Assessing Large Language Models Using Chinese Junior High School Exam Data