Abstract:Large language model pre-training has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these rules lack the flexibility to address the unique characteristics of individual example effectively. Meanwhile, applying tailored rules to every example is impractical for human experts. In this paper, we demonstrate that even small language models, with as few as 0.3B parameters, can exhibit substantial data refining capabilities comparable to those of human experts. We introduce Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, enabling models to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experimental results show that models pre-trained on ProX-curated data outperform either original data or data filtered by other selection methods by more than 2% across various downstream benchmarks. Its effectiveness spans various model sizes and pre-training corpora, including C4, RedPajama-V2, and FineWeb. Furthermore, ProX exhibits significant potential in domain-specific continual pre-training: without domain specific design, models trained on OpenWebMath refined by ProX outperform human-crafted rule-based methods, improving average accuracy by 7.6% over Mistral-7B, with 14.6% for Llama-2-7B and 20.3% for CodeLlama-7B, all within 10B tokens to be comparable to models like Llemma-7B trained on 200B tokens. Further analysis highlights that ProX significantly saves training FLOPs, offering a promising path for efficient LLM pre-training.We are open-sourcing ProX with >100B corpus, models, and sharing all training and implementation details for reproducible research and future innovation. Code: <a class="link-external link-https" href="https://github.com/GAIR-NLP/ProX" rel="external noopener nofollow">this https URL</a>

Leveraging Web-Crawled Data for High-Quality Fine-Tuning

ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with Effective Evaluation Model

Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately

Fine-Tuning or Fine-Failing? Debunking Performance Myths in Large Language Models

Automated Data Curation for Robust Language Model Fine-Tuning

Does your data spark joy? Performance gains from domain upsampling at the end of training

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Learning from "Silly" Questions Improves Large Language Models, But Only Slightly

Dial-insight: Fine-tuning Large Language Models with High-Quality Domain-Specific Data Preventing Capability Collapse

Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as the Key

From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Enhancing Data Quality in Federated Fine-Tuning of Foundation Models

Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

Labeling supervised fine-tuning data with the scaling law

Leveraging Large Language Models for Enhanced NLP Task Performance through Knowledge Distillation and Optimized Training Strategies

FineWeb-zhtw: Scalable Curation of Traditional Chinese Text Data from the Web

Fine-tuning ChatGPT for Automatic Scoring

CITING: Large Language Models Create Curriculum for Instruction Tuning

Ladder: A Model-Agnostic Framework Boosting LLM-based Machine Translation to the Next Level

AnyTaskTune: Advanced Domain-Specific Solutions through Task-Fine-Tuning