Oasis: Data Curation and Assessment System for Pretraining of Large Language Models

Tong Zhou,Yubo Chen,Pengfei Cao,Kang Liu,Jun Zhao,Shengping Liu
2023-11-21
Abstract:Data is one of the most critical elements in building a large language model. However, existing systems either fail to customize a corpus curation pipeline or neglect to leverage comprehensive corpus assessment for iterative optimization of the curation. To this end, we present a pretraining corpus curation and assessment platform called Oasis -- a one-stop system for data quality improvement and quantification with user-friendly interactive interfaces. Specifically, the interactive modular rule filter module can devise customized rules according to explicit feedback. The debiased neural filter module builds the quality classification dataset in a negative-centric manner to remove the undesired bias. The adaptive document deduplication module could execute large-scale deduplication with limited memory resources. These three parts constitute the customized data curation module. And in the holistic data assessment module, a corpus can be assessed in local and global views, with three evaluation means including human, GPT-4, and heuristic metrics. We exhibit a complete process to use Oasis for the curation and assessment of pretraining data. In addition, an 800GB bilingual corpus curated by Oasis is publicly released.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the problem of how to efficiently curate and evaluate data when building large-scale language models (LLMs). Specifically, existing systems either cannot customize the data curation process or neglect comprehensive data evaluation, leading to deficiencies in the iterative optimization of data curation. To this end, the authors propose a one-stop pre-training corpus curation and evaluation platform called Oasis, aimed at improving data quality and quantifying data value, while providing a user-friendly interactive interface. ### Main Issues: 1. **Data Curation**: - Existing data curation methods lack customization; different data sources require different curation processes, but current systems cannot adapt flexibly. - There is a lack of an open-source, customizable pre-training data curation system. 2. **Data Evaluation**: - Existing data evaluation methods mostly rely on the final model's performance, which consumes a lot of resources and is inefficient. - There is a lack of a comprehensive, multi-dimensional, and easy-to-use data evaluation system that can assess the quality of the pre-training corpus from multiple angles. ### Solution: - **Oasis Platform**: - **Customizable Data Curation Module**: - **Interactive Module Rule Filter**: Allows users to build custom rule sets based on explicit feedback. - **Debiasing Neural Filter**: Constructs a quality-classified dataset using a negative sample center method to remove unwanted biases. - **Adaptive Document Deduplication Module**: Performs large-scale deduplication operations with limited memory resources. - **Comprehensive Data Evaluation Module**: - **Local Quality Evaluation**: Evaluates aspects such as sentence fluency and document coherence, supporting manual evaluation, GPT-4 evaluation, and heuristic metrics. - **Global Distribution Evaluation**: Assesses the diversity and richness of the corpus through various heuristic metrics. ### Experiments and Applications: - The authors demonstrate how to use the Oasis platform to curate and evaluate Common Crawl data, ultimately generating an 800GB bilingual corpus in Chinese and English, which is publicly released. - Comparative experiments prove the effectiveness of the Oasis platform in improving data quality and diversity. ### Conclusion: By proposing the Oasis platform, this paper addresses key issues in the curation and evaluation of pre-training data for large-scale language models, providing a comprehensive, customizable, and efficient solution.