Abstract:High-quality data is crucial for the pre-training performance of large language models. Unfortunately, existing quality filtering methods rely on a known high-quality dataset as reference, which can introduce potential bias and compromise diversity. In this paper, we propose ScalingFilter, a novel approach that evaluates text quality based on the perplexity difference between two language models trained on the same data, thereby eliminating the influence of the reference dataset in the filtering process. An theoretical analysis shows that ScalingFilter is equivalent to an inverse utilization of scaling laws. Through training models with 1.3B parameters on the same data source processed by various quality filters, we find ScalingFilter can improve zero-shot performance of pre-trained models in downstream tasks. To assess the bias introduced by quality filtering, we introduce semantic diversity, a metric of utilizing text embedding models for semantic representations. Extensive experiments reveal that semantic diversity is a reliable indicator of dataset diversity, and ScalingFilter achieves an optimal balance between downstream performance and semantic diversity.

What problem does this paper attempt to address?

The paper primarily addresses the issue of data quality assessment and filtering during the pre-training of large language models. Specifically, the researchers propose a new method called "ScalingFilter," which aims to evaluate the quality of text data through a novel quality factor, thereby filtering out high-quality datasets for training. ### Research Background and Problem Existing data quality filtering methods typically rely on known high-quality datasets as reference standards, which may introduce bias and limit the diversity and representativeness of the datasets. Additionally, some methods depend on indirect metrics such as the perplexity scores of pre-trained models to assess data quality, which may not accurately reflect the true quality of the data. ### Core Ideas of ScalingFilter - **Definition of Quality Factor**: A quality factor is defined by calculating the difference in perplexity scores given by two pre-trained models of different sizes on the same text sample. - **Utilizing the Law of Model Scaling**: This method is based on the inverse application of the law of model scaling, i.e., analyzing the perplexity differences of models of different sizes on the same data and using this difference as a standard to measure data quality. - **Theoretical Analysis**: Theoretical analysis shows that the quality factor is positively correlated with data quality, meaning a higher quality factor indicates better data quality. - **Experimental Validation**: Experimental results demonstrate that data filtered using ScalingFilter can significantly improve the performance of downstream tasks and better maintain the diversity of the dataset. ### Main Contributions 1. **Quality Factor**: A new concept of quality factor is proposed, which is directly related to the quality of training data, providing a more accurate and unbiased data filtering method. 2. **ScalingFilter**: A new data quality filtering method, ScalingFilter, is proposed. This method uses the quality factor to filter high-quality datasets without relying on reference datasets, thereby reducing potential bias and increasing the representativeness of the training corpus. 3. **Semantic Diversity Metric**: To evaluate the diversity of the filtered dataset, a new metric called semantic diversity is introduced, and its effectiveness and reliability are validated through extensive experiments. In summary, the method proposed in this paper aims to improve the shortcomings of existing data quality assessment and filtering techniques. Through theoretical analysis and empirical research, the effectiveness of the proposed method is demonstrated.

ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws

Scaling Parameter-Constrained Language Models with Quality Data

Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic

The Best of Both Worlds: Bridging Quality and Diversity in Data Selection with Bipartite Graph

Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining

Language models scale reliably with over-training and on downstream tasks

Inverse Scaling: When Bigger Isn't Better

Beyond Scale: The Diversity Coefficient as a Data Quality Metric for Variability in Natural Language Data

Filter & Align: Leveraging Human Knowledge to Curate Image-Text Data

The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering

Improving Data Efficiency via Curating LLM-Driven Rating Systems

Efficient NLP Model Finetuning via Multistage Data Filtering

Data Quality Enhancement on the Basis of Diversity with Large Language Models for Text Classification: Uncovered, Difficult, and Noisy

AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs

Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters

Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data

Scaling Laws for Multilingual Language Models

Scaling Concept With Text-Guided Diffusion Models

FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding

Information filtering via a scaling-based function

When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method