Explaining Neural Scaling Laws

Yasaman Bahri,Ethan Dyer,Jared Kaplan,Jaehoon Lee,Utkarsh Sharma
DOI: https://doi.org/10.1073/pnas.2311878121
2024-04-29
Abstract:The population loss of trained deep neural networks often follows precise power-law scaling relations with either the size of the training dataset or the number of parameters in the network. We propose a theory that explains the origins of and connects these scaling laws. We identify variance-limited and resolution-limited scaling behavior for both dataset and model size, for a total of four scaling regimes. The variance-limited scaling follows simply from the existence of a well-behaved infinite data or infinite width limit, while the resolution-limited regime can be explained by positing that models are effectively resolving a smooth data manifold. In the large width limit, this can be equivalently obtained from the spectrum of certain kernels, and we present evidence that large width and large dataset resolution-limited scaling exponents are related by a duality. We exhibit all four scaling regimes in the controlled setting of large random feature and pretrained models and test the predictions empirically on a range of standard architectures and datasets. We also observe several empirical relationships between datasets and scaling exponents under modifications of task and architecture aspect ratio. Our work provides a taxonomy for classifying different scaling regimes, underscores that there can be different mechanisms driving improvements in loss, and lends insight into the microscopic origins of and relationships between scaling exponents.
Machine Learning,Disordered Systems and Neural Networks
What problem does this paper attempt to address?
This paper discusses the problem of the scale law of neural networks. Researchers have observed that well-trained deep neural networks often follow a precise power-law scaling relationship with the size of the training data set or the number of network parameters. They proposed a theory to explain and connect the origins of these scaling laws. The paper distinguishes between two different scaling behaviors: variance-limited and resolution-limited, corresponding to four scaling scenarios involving data sets and model sizes. In the variance-limited scenario, the rate of performance improvement is determined by the infinite limits of data or width as the data size or network width approaches infinity, exhibiting a power-law relationship that is independent of architecture or underlying data sets. The resolution-limited scenario indicates that the performance improvement of over-parameterized models with increasing data size or under-parameterized models with increasing model size is influenced by the details of the data distribution, exhibiting power-law exponents between 0 and 1. The paper conducted experiments using random feature models and pre-trained models to validate these four scaling scenarios and performed empirical tests on standard architectures and data sets. The research also found that there may be a certain dual relationship between the scaling exponents of data sets and model sizes, and in scenarios with large width and high-resolution data sets, these exponents may be related to the spectrum of specific kernels. In addition, the paper studied how variations in different tasks and architectural aspects affect the relationship between data set and scaling exponents. Finally, the authors provided a classification of the scale law of neural networks, emphasized the different driving mechanisms of performance improvement, and gained a deeper understanding of the micro-origin and mutual relationship of scaling exponents, providing theoretical and empirical guidance for machine learning in the era of large-scale models and training data.