Abstract:During the development of large language models (LLMs), pre-training data play a critical role in shaping LLMs' capabilities. In recent years several large-scale and high-quality pre-training datasets have been released to accelerate the research of LLMs, including ChineseWebText1.0, C4, Pile, WanJuan, MAPCC and others. However, as LLMs continue to evolve, focus has increasingly shifted to domain-specific capabilities and safety concerns, making those previous coarse-grained texts insufficient for meeting training requirements. Furthermore, fine-grained information, such as quality, domain and toxicity, is becoming increasingly important in building powerful and reliable LLMs for various scenarios. To address these challenges, in this paper we propose a new tool-chain called MDFG-tool for constructing large-scale and high-quality Chinese datasets with multi-dimensional and fine-grained information. First, we employ manually crafted rules to discard explicit noisy texts from raw contents. Second, the quality evaluation model, domain classifier, and toxicity evaluation model are well-designed to assess the remaining cleaned data respectively. Finally, we integrate these three types of fine-grained information for each text. With this approach, we release the largest, high-quality and fine-grained Chinese text ChineseWebText2.0, which consists of 3.8TB and each text is associated with a quality score, domain labels, a toxicity label and a toxicity score, facilitating the LLM researchers to select data based on various types of fine-grained information. The data, codes and the tool-chain are available on this website <a class="link-external link-https" href="https://github.com/CASIA-LM/ChineseWebText-2.0" rel="external noopener nofollow">this https URL</a>

Chuweb21D: A Deduped English Document Collection for Web Search Tasks

SogouT-16

ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with Effective Evaluation Model

Overview of the NTCIR-15 We Want Web with CENTRE (WWW-3) Task

Overview of the ntcir-13 we want web task

Chinese Web Retrieval Test Collections: Construction, Analysis and Application

FineWeb-zhtw: Scalable Curation of Traditional Chinese Text Data from the Web

Overview of the ntcir-14 we want web task

ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information

Improved fuzzy set information retrieval approach on duplicate webpage detection

HC4: A New Suite of Test Collections for Ad Hoc CLIR

Effective and Efficient Query-aware Snippet Extraction for Web Search

THUIR at NTCIR-13 WWW Task.

WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset

Data Cleansing for Web Information Retrieval Using Query Independent Features

Sogou-ST: A New Dataset with Large-scale Refined Real-world Web Search Sessions

Web Data Cleansing for Effective Information Retrieval

Tiangong-St: A New Dataset With Large-Scale Refined Real-World Web Search Sessions

DuReader_retrieval: A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine

Cleaner Pretraining Corpus Curation with Neural Web Scraping

A set of novel HTML document quality features for Web information retrieval: Including applications to learning to rank for information retrieval