The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Guilherme Penedo,Quentin Malartic,Daniel Hesslow,Ruxandra Cojocaru,Alessandro Cappelli,Hamza Alobeidli,Baptiste Pannier,Ebtesam Almazrouei,Julien Launay

2023-06-02

Abstract:Large language models are commonly trained on a mixture of filtered web data and curated high-quality corpora, such as social media conversations, books, or technical papers. This curation process is believed to be necessary to produce performant models with broad zero-shot generalization abilities. However, as larger models requiring pretraining on trillions of tokens are considered, it is unclear how scalable is curation and whether we will run out of unique high-quality data soon. At variance with previous beliefs, we show that properly filtered and deduplicated web data alone can lead to powerful models; even significantly outperforming models from the state-of-the-art trained on The Pile. Despite extensive filtering, the high-quality data we extract from the web is still plentiful, and we are able to obtain five trillion tokens from CommonCrawl. We publicly release an extract of 600 billion tokens from our RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The paper attempts to address the issue of the source and quality of training data for large-scale language models (LLMs). Specifically, the paper explores the following points: 1. **Data requirements for large-scale language models**: As the scale of models increases, the demand for training data is also rapidly growing. Existing high-quality datasets (such as The Pile) may not meet this demand because these datasets are limited in size and costly to obtain. 2. **Data quality and diversity**: The traditional view holds that high-quality models require a mix of filtered web data and carefully curated "high-quality" corpora (such as social media conversations, books, or technical papers). However, there are doubts about whether this approach is sustainable for large-scale models. 3. **Effectiveness of web data**: The paper challenges the traditional view by proposing that properly filtered and deduplicated web data alone can produce powerful models, even surpassing models trained on high-quality corpora in some cases. 4. **Importance of data deduplication**: The paper emphasizes the importance of data deduplication in improving model performance, especially in large-scale datasets. Deduplication can reduce the model's memory effect and enhance generalization ability. By constructing a high-quality web dataset named **RefinedWeb**, the paper aims to demonstrate that using only web data can also train high-performance large-scale language models, thereby providing new ideas and tools for future research.

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

SWEb: A Large Web Dataset for the Scandinavian Languages

Leveraging Web-Crawled Data for High-Quality Fine-Tuning

ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with Effective Evaluation Model

Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset

RedStone: Curating General, Code, Math, and QA Data for Large Language Models

FineWeb-zhtw: Scalable Curation of Traditional Chinese Text Data from the Web

RedPajama: an Open Dataset for Training Large Language Models

Does your data spark joy? Performance gains from domain upsampling at the end of training

Zyda: A 1.3T Dataset for Open Language Modeling

Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language

Webly Supervised Fine-Grained Recognition: Benchmark Datasets and An Approach

Fineweb-Edu-Ar: Machine-translated Corpus to Support Arabic Small Language Models

Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

DataComp-LM: In search of the next generation of training sets for language models

The Web Can Be Your Oyster for Improving Large Language Models

Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

Less is More: Accurate Speech Recognition & Translation without Web-Scale Data