ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws

Ruihang Li,Yixuan Wei,Miaosen Zhang,Nenghai Yu,Han Hu,Houwen Peng
2024-08-16
Abstract:High-quality data is crucial for the pre-training performance of large language models. Unfortunately, existing quality filtering methods rely on a known high-quality dataset as reference, which can introduce potential bias and compromise diversity. In this paper, we propose ScalingFilter, a novel approach that evaluates text quality based on the perplexity difference between two language models trained on the same data, thereby eliminating the influence of the reference dataset in the filtering process. An theoretical analysis shows that ScalingFilter is equivalent to an inverse utilization of scaling laws. Through training models with 1.3B parameters on the same data source processed by various quality filters, we find ScalingFilter can improve zero-shot performance of pre-trained models in downstream tasks. To assess the bias introduced by quality filtering, we introduce semantic diversity, a metric of utilizing text embedding models for semantic representations. Extensive experiments reveal that semantic diversity is a reliable indicator of dataset diversity, and ScalingFilter achieves an optimal balance between downstream performance and semantic diversity.
Computation and Language
What problem does this paper attempt to address?
The paper primarily addresses the issue of data quality assessment and filtering during the pre-training of large language models. Specifically, the researchers propose a new method called "ScalingFilter," which aims to evaluate the quality of text data through a novel quality factor, thereby filtering out high-quality datasets for training. ### Research Background and Problem Existing data quality filtering methods typically rely on known high-quality datasets as reference standards, which may introduce bias and limit the diversity and representativeness of the datasets. Additionally, some methods depend on indirect metrics such as the perplexity scores of pre-trained models to assess data quality, which may not accurately reflect the true quality of the data. ### Core Ideas of ScalingFilter - **Definition of Quality Factor**: A quality factor is defined by calculating the difference in perplexity scores given by two pre-trained models of different sizes on the same text sample. - **Utilizing the Law of Model Scaling**: This method is based on the inverse application of the law of model scaling, i.e., analyzing the perplexity differences of models of different sizes on the same data and using this difference as a standard to measure data quality. - **Theoretical Analysis**: Theoretical analysis shows that the quality factor is positively correlated with data quality, meaning a higher quality factor indicates better data quality. - **Experimental Validation**: Experimental results demonstrate that data filtered using ScalingFilter can significantly improve the performance of downstream tasks and better maintain the diversity of the dataset. ### Main Contributions 1. **Quality Factor**: A new concept of quality factor is proposed, which is directly related to the quality of training data, providing a more accurate and unbiased data filtering method. 2. **ScalingFilter**: A new data quality filtering method, ScalingFilter, is proposed. This method uses the quality factor to filter high-quality datasets without relying on reference datasets, thereby reducing potential bias and increasing the representativeness of the training corpus. 3. **Semantic Diversity Metric**: To evaluate the diversity of the filtered dataset, a new metric called semantic diversity is introduced, and its effectiveness and reliability are validated through extensive experiments. In summary, the method proposed in this paper aims to improve the shortcomings of existing data quality assessment and filtering techniques. Through theoretical analysis and empirical research, the effectiveness of the proposed method is demonstrated.