Scaling Parameter-Constrained Language Models with Quality Data

Ernie Chang,Matteo Paltenghi,Yang Li,Pin-Jie Lin,Changsheng Zhao,Patrick Huber,Zechun Liu,Rastislav Rabatin,Yangyang Shi,Vikas Chandra
2024-10-04
Abstract:Scaling laws in language modeling traditionally quantify training loss as a function of dataset size and model parameters, providing compute-optimal estimates but often neglecting the impact of data quality on model generalization. In this paper, we extend the conventional understanding of scaling law by offering a microscopic view of data quality within the original formulation -- effective training tokens -- which we posit to be a critical determinant of performance for parameter-constrained language models. Specifically, we formulate the proposed term of effective training tokens to be a combination of two readily-computed indicators of text: (i) text diversity and (ii) syntheticity as measured by a teacher model. We pretrained over $200$ models of 25M to 1.5B parameters on a diverse set of sampled, synthetic data, and estimated the constants that relate text quality, model size, training tokens, and eight reasoning task accuracy scores. We demonstrated the estimated constants yield +0.83 Pearson correlation with true accuracies, and analyzed it in scenarios involving widely-used data techniques such as data sampling and synthesis which aim to improve data quality.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in the scaling laws of language models, the influence of data quality on the generalization ability of models has been overlooked. Traditionally, scaling laws mainly focus on the influence of data volume and the number of model parameters on training loss, while ignoring the importance of data quality. By introducing the concept of "effective training tokens", this paper emphasizes the crucial role of data quality in the performance of parameter - limited language models. Specifically, the author proposes an index that combines text diversity and composition to measure the number of effective training tokens, and verifies the effectiveness of this concept through a large number of experiments. This helps to better understand how data quality affects model training and performance, especially in the case of parameter - limited situations. ### Main Contributions 1. **Expand Traditional Scaling Laws**: Expand the traditional scaling laws from only considering data volume and model parameters to include a data quality indicator - effective training tokens. This emphasizes the importance of data quality in the scaling equation and makes up for an important omission in previous formulas. 2. **Research on Data Refinement Techniques**: Explore the relationship between data selection (such as deduplication) and synthesis techniques and data quality indicators (such as diversity and composition). The research results show that data quality, rather than simply data volume, can significantly improve model performance. ### Background - **Chinchilla Scaling Laws**: Initially proposed by Hoffmann et al., used to estimate model training loss, taking into account the number of training tokens and the number of model parameters. These laws are mainly used to optimize the computing resources required for large - scale pre - training. - **Data Refinement Techniques**: Divided into non - transformational and transformational types. Non - transformational types include data deduplication and selection, while transformational types involve generating new text data. These techniques directly affect the distribution and quality of training tokens, thereby improving training efficiency and effectiveness. ### Quantification of Data Quality - **Diversity**: Use the Compression Ratio (CR) to measure the diversity of text. A high compression ratio indicates that there is more repetitive content in the text, that is, lower diversity. - **Composition**: Use Perplexity to measure the composition of data points. A low perplexity indicates that the data point is more in line with the model's prediction pattern, that is, higher composition. ### Specific Methods - **Modify Chinchilla Scaling Laws**: Incorporate data quality into the formula of the scaling law, and predict the average accuracy of zero - sample inference tasks through the quality - adjusted number of training tokens \( D_q \). - **Experimental Setup**: Pre - train multiple models of different sizes on different datasets and evaluate the impact of data refinement techniques on model performance. ### Experimental Results - **Correlation Analysis**: The Pearson correlation coefficient between the estimated constant and the actual accuracy reaches +0.83, indicating that data quality has a significant impact on model performance. - **Relationship between Data Quality and Model Capacity**: For smaller models (25M to 500M parameters), the influence of data quality is more obvious; while for larger models (more than 1.5B parameters), the influence of data volume is more important. ### Conclusion This paper expands the traditional scaling laws by introducing the concept of "effective training tokens" and emphasizes the importance of data quality in model training. The research results provide a theoretical basis for developing more efficient and compact language models, especially models suitable for device - side applications. However, this research also has some limitations, such as the need for a large number of sample points to accurately estimate constants, and the reliability issues when dealing with rare or complex data characteristics. Future research can further improve these aspects to enhance the robustness and performance of models.