Abstract:Data is one of the most critical elements in building a large language model. However, existing systems either fail to customize a corpus curation pipeline or neglect to leverage comprehensive corpus assessment for iterative optimization of the curation. To this end, we present a pretraining corpus curation and assessment platform called Oasis -- a one-stop system for data quality improvement and quantification with user-friendly interactive interfaces. Specifically, the interactive modular rule filter module can devise customized rules according to explicit feedback. The debiased neural filter module builds the quality classification dataset in a negative-centric manner to remove the undesired bias. The adaptive document deduplication module could execute large-scale deduplication with limited memory resources. These three parts constitute the customized data curation module. And in the holistic data assessment module, a corpus can be assessed in local and global views, with three evaluation means including human, GPT-4, and heuristic metrics. We exhibit a complete process to use Oasis for the curation and assessment of pretraining data. In addition, an 800GB bilingual corpus curated by Oasis is publicly released.

What problem does this paper attempt to address?

The paper attempts to address the problem of how to efficiently curate and evaluate data when building large-scale language models (LLMs). Specifically, existing systems either cannot customize the data curation process or neglect comprehensive data evaluation, leading to deficiencies in the iterative optimization of data curation. To this end, the authors propose a one-stop pre-training corpus curation and evaluation platform called Oasis, aimed at improving data quality and quantifying data value, while providing a user-friendly interactive interface. ### Main Issues: 1. **Data Curation**: - Existing data curation methods lack customization; different data sources require different curation processes, but current systems cannot adapt flexibly. - There is a lack of an open-source, customizable pre-training data curation system. 2. **Data Evaluation**: - Existing data evaluation methods mostly rely on the final model's performance, which consumes a lot of resources and is inefficient. - There is a lack of a comprehensive, multi-dimensional, and easy-to-use data evaluation system that can assess the quality of the pre-training corpus from multiple angles. ### Solution: - **Oasis Platform**: - **Customizable Data Curation Module**: - **Interactive Module Rule Filter**: Allows users to build custom rule sets based on explicit feedback. - **Debiasing Neural Filter**: Constructs a quality-classified dataset using a negative sample center method to remove unwanted biases. - **Adaptive Document Deduplication Module**: Performs large-scale deduplication operations with limited memory resources. - **Comprehensive Data Evaluation Module**: - **Local Quality Evaluation**: Evaluates aspects such as sentence fluency and document coherence, supporting manual evaluation, GPT-4 evaluation, and heuristic metrics. - **Global Distribution Evaluation**: Assesses the diversity and richness of the corpus through various heuristic metrics. ### Experiments and Applications: - The authors demonstrate how to use the Oasis platform to curate and evaluate Common Crawl data, ultimately generating an 800GB bilingual corpus in Chinese and English, which is publicly released. - Comparative experiments prove the effectiveness of the Oasis platform in improving data quality and diversity. ### Conclusion: By proposing the Oasis platform, this paper addresses key issues in the curation and evaluation of pre-training data for large-scale language models, providing a comprehensive, customizable, and efficient solution.

Oasis: Data Curation and Assessment System for Pretraining of Large Language Models

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

Automated Data Curation for Robust Language Model Fine-Tuning

Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development

Oasis - Online Analytic System for Incivility Detection and Sentiment Classification.

Assess and Summarize: Improve Outage Understanding with Large Language Models

What's In My Big Data?

OPAL: Ontology-Aware Pretrained Language Model for End-to-End Task-Oriented Dialogue

RedStone: Curating General, Code, Math, and QA Data for Large Language Models

Panda LLM: Training Data and Evaluation for Open-Sourced Chinese Instruction-Following Large Language Models

OmniEvalKit: A Modular, Lightweight Toolbox for Evaluating Large Language Model and its Omni-Extensions

BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data

BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline

Jellyfish: A Large Language Model for Data Preprocessing

API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs

Toxicity of the Commons: Curating Open-Source Pre-Training Data

DataComp-LM: In search of the next generation of training sets for language models

Curriculum Learning with Quality-Driven Data Selection

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

KoLA: Carefully Benchmarking World Knowledge of Large Language Models

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset