Abstract:During the development of large language models (LLMs), pre-training data play a critical role in shaping LLMs' capabilities. In recent years several large-scale and high-quality pre-training datasets have been released to accelerate the research of LLMs, including ChineseWebText1.0, C4, Pile, WanJuan, MAPCC and others. However, as LLMs continue to evolve, focus has increasingly shifted to domain-specific capabilities and safety concerns, making those previous coarse-grained texts insufficient for meeting training requirements. Furthermore, fine-grained information, such as quality, domain and toxicity, is becoming increasingly important in building powerful and reliable LLMs for various scenarios. To address these challenges, in this paper we propose a new tool-chain called MDFG-tool for constructing large-scale and high-quality Chinese datasets with multi-dimensional and fine-grained information. First, we employ manually crafted rules to discard explicit noisy texts from raw contents. Second, the quality evaluation model, domain classifier, and toxicity evaluation model are well-designed to assess the remaining cleaned data respectively. Finally, we integrate these three types of fine-grained information for each text. With this approach, we release the largest, high-quality and fine-grained Chinese text ChineseWebText2.0, which consists of 3.8TB and each text is associated with a quality score, domain labels, a toxicity label and a toxicity score, facilitating the LLM researchers to select data based on various types of fine-grained information. The data, codes and the tool-chain are available on this website <a class="link-external link-https" href="https://github.com/CASIA-LM/ChineseWebText-2.0" rel="external noopener nofollow">this https URL</a>

FineWeb-zhtw: Scalable Curation of Traditional Chinese Text Data from the Web

ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information

ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with Effective Evaluation Model

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

TCMChat: A Generative Large Language Model for Traditional Chinese Medicine

WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset

WYWEB: A NLP Evaluation Benchmark For Classical Chinese

Webly Supervised Fine-Grained Recognition: Benchmark Datasets and An Approach

Leveraging Web-Crawled Data for High-Quality Fine-Tuning

CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for pre-training large language models

Overview of CTC 2021: Chinese Text Correction for Native Speakers

LongWanjuan: Towards Systematic Measurement for Long Text Quality

An Improved Traditional Chinese Evaluation Suite for Foundation Model

CT-Eval: Benchmarking Chinese Text-to-Table Performance in Large Language Models

A Large Chinese Text Dataset in the Wild

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Machine Translation Evaluation Benchmark for Wu Chinese: Workflow and Analysis

Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical Study

Chinese Web-page Classification Study

WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models

CNVid-3.5M: Build, Filter, and Pre-Train the Large-Scale Public Chinese Video-Text Dataset