Abstract:Large language model pre-training has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these rules lack the flexibility to address the unique characteristics of individual example effectively. Meanwhile, applying tailored rules to every example is impractical for human experts. In this paper, we demonstrate that even small language models, with as few as 0.3B parameters, can exhibit substantial data refining capabilities comparable to those of human experts. We introduce Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, enabling models to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experimental results show that models pre-trained on ProX-curated data outperform either original data or data filtered by other selection methods by more than 2% across various downstream benchmarks. Its effectiveness spans various model sizes and pre-training corpora, including C4, RedPajama-V2, and FineWeb. Furthermore, ProX exhibits significant potential in domain-specific continual pre-training: without domain specific design, models trained on OpenWebMath refined by ProX outperform human-crafted rule-based methods, improving average accuracy by 7.6% over Mistral-7B, with 14.6% for Llama-2-7B and 20.3% for CodeLlama-7B, all within 10B tokens to be comparable to models like Llemma-7B trained on 200B tokens. Further analysis highlights that ProX significantly saves training FLOPs, offering a promising path for efficient LLM pre-training.We are open-sourcing ProX with >100B corpus, models, and sharing all training and implementation details for reproducible research and future innovation. Code: <a class="link-external link-https" href="https://github.com/GAIR-NLP/ProX" rel="external noopener nofollow">this https URL</a>

Iterative Data Programming for Expanding Text Classification Corpora

Data Programming by Demonstration: A Framework for Interactively Learning Labeling Functions

The Word is Mightier than the Label: Learning without Pointillistic Labels using Data Programming

Incubating Text Classifiers Following User Instruction with Nothing but LLM

Semi-Supervised Data Programming with Subset Selection

Not Enough Data? Deep Learning to the Rescue!

Making Large Language Models Better Data Creators

Automating Weak Label Generation for Data Programming with Clinicians in the Loop

Text Data Augmentation for Deep Learning

TEXTRON: Weakly Supervised Multilingual Text Detection through Data Programming

ActiveDP: Bridging Active Learning and Data Programming

Improving Classification through Weak Supervision in Context-specific Conversational Agent Development for Teacher Education

Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers

DAGAM: Data Augmentation with Generation And Modification

Exploring Continual Learning for Code Generation Models

Thinking Like an Annotator: Generation of Dataset Labeling Instructions

In-Context Learning for Extreme Multi-Label Classification

CoDa: Constrained Generation based Data Augmentation for Low-Resource NLP

EvoVis: A Visual Analytics Method to Understand the Labeling Iterations in Data Programming

From Words to Code: Harnessing Data for Program Synthesis from Natural Language