QuRating: Selecting High-Quality Data for Training Language Models

Alexander Wettig,Aatmik Gupta,Saumya Malik,Danqi Chen

2024-07-18

Abstract:Selecting high-quality pre-training data is important for creating capable language models, but existing methods rely on simple heuristics. We introduce QuRating, a method for selecting pre-training data that can capture human intuitions about data quality. In this paper, we investigate four qualities - writing style, required expertise, facts & trivia, and educational value - and find that LLMs are able to discern these qualities, especially when making pairwise judgments of texts. We train a QuRater model to learn scalar ratings from pairwise judgments, and use it to annotate a 260B training corpus with quality ratings for each of the four criteria. In our experiments, we select 30B tokens according to the different quality ratings and train 1.3B-parameter language models on the selected data. We find that it is important to balance quality and diversity. When we sample using quality ratings as logits over documents, our models obtain lower perplexity and stronger in-context learning performance than baselines. Our best model is based on educational value and performs similarly to a model trained with uniform sampling for 50% more steps. Beyond data selection, we use the quality ratings to construct a training curriculum which improves performance without changing the training dataset. We extensively analyze the quality ratings and discuss their characteristics, biases, and wider implications.

Computation and Language,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to select high - quality training data to create more powerful language models. Existing methods usually rely on simple heuristic rules, which may not fully capture human intuitive judgments about data quality. For this reason, the paper proposes a method named QuRating. This method evaluates the quality of data by comparing text pairs and trains a model (called the QuRater model) to learn scalar quality scores from these comparisons. Specifically, the paper explores four quality criteria: writing style, required expertise, facts and trivia, and educational value, and finds that large language models (LLMs) are able to distinguish these qualities, especially when making text - pair comparisons. The main steps of the QuRating method include: 1. **Compare text pairs**: Compare pairs of texts according to the selected quality criteria. 2. **Train the QuRater model**: Use the judgment results obtained from text - pair comparisons to train a model to predict the quality score of each document. 3. **Select pre - training data**: Use these quality scores to select a subset in the pre - training data set. 4. **Evaluate the value of quality criteria**: Determine which abstract qualities are valuable by training a language model on the selected data subset and evaluating its performance. Through this method, researchers hope to improve data quality while maintaining data diversity, thereby training a language model with better performance. Experimental results show that the language model trained with the data selected by the QuRating method performs better than the baseline model on multiple tasks, especially in balancing quality and diversity.

QuRating: Selecting High-Quality Data for Training Language Models

Rule-based Data Selection for Large Language Models

Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs

Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models

Data Quality Enhancement on the Basis of Diversity with Large Language Models for Text Classification: Uncovered, Difficult, and Noisy

Improving Data Efficiency via Curating LLM-Driven Rating Systems

How to Train Data-Efficient LLMs

Scaling Parameter-Constrained Language Models with Quality Data

Text Quality-Based Pruning for Efficient Training of Language Models

A Survey on Data Selection for Language Models

Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection

Curriculum Learning with Quality-Driven Data Selection

Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries

1.5-Pints Technical Report: Pretraining in Days, Not Months -- Your Language Model Thrives on Quality Data

Data Selection for Language Models via Importance Resampling

Dialogue-adaptive Language Model Pre-Training from Quality Estimation

A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity

Improving Pretraining Data Using Perplexity Correlations

Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

Dial-insight: Fine-tuning Large Language Models with High-Quality Domain-Specific Data Preventing Capability Collapse