Abstract:The population loss of trained deep neural networks often follows precise power-law scaling relations with either the size of the training dataset or the number of parameters in the network. We propose a theory that explains the origins of and connects these scaling laws. We identify variance-limited and resolution-limited scaling behavior for both dataset and model size, for a total of four scaling regimes. The variance-limited scaling follows simply from the existence of a well-behaved infinite data or infinite width limit, while the resolution-limited regime can be explained by positing that models are effectively resolving a smooth data manifold. In the large width limit, this can be equivalently obtained from the spectrum of certain kernels, and we present evidence that large width and large dataset resolution-limited scaling exponents are related by a duality. We exhibit all four scaling regimes in the controlled setting of large random feature and pretrained models and test the predictions empirically on a range of standard architectures and datasets. We also observe several empirical relationships between datasets and scaling exponents under modifications of task and architecture aspect ratio. Our work provides a taxonomy for classifying different scaling regimes, underscores that there can be different mechanisms driving improvements in loss, and lends insight into the microscopic origins of and relationships between scaling exponents.

What problem does this paper attempt to address?

This paper discusses the problem of the scale law of neural networks. Researchers have observed that well-trained deep neural networks often follow a precise power-law scaling relationship with the size of the training data set or the number of network parameters. They proposed a theory to explain and connect the origins of these scaling laws. The paper distinguishes between two different scaling behaviors: variance-limited and resolution-limited, corresponding to four scaling scenarios involving data sets and model sizes. In the variance-limited scenario, the rate of performance improvement is determined by the infinite limits of data or width as the data size or network width approaches infinity, exhibiting a power-law relationship that is independent of architecture or underlying data sets. The resolution-limited scenario indicates that the performance improvement of over-parameterized models with increasing data size or under-parameterized models with increasing model size is influenced by the details of the data distribution, exhibiting power-law exponents between 0 and 1. The paper conducted experiments using random feature models and pre-trained models to validate these four scaling scenarios and performed empirical tests on standard architectures and data sets. The research also found that there may be a certain dual relationship between the scaling exponents of data sets and model sizes, and in scenarios with large width and high-resolution data sets, these exponents may be related to the spectrum of specific kernels. In addition, the paper studied how variations in different tasks and architectural aspects affect the relationship between data set and scaling exponents. Finally, the authors provided a classification of the scale law of neural networks, emphasized the different driving mechanisms of performance improvement, and gained a deeper understanding of the micro-origin and mutual relationship of scaling exponents, providing theoretical and empirical guidance for machine learning in the era of large-scale models and training data.

Explaining Neural Scaling Laws

A Dynamical Model of Neural Scaling Laws

Scaling Graph Neural Networks for Large-Scale Power Systems Analysis: Empirical Laws for Emergent Abilities

A Solvable Model of Neural Scaling Laws

Information-Theoretic Foundations for Neural Scaling Laws

A Neural Scaling Law from the Dimension of the Data Manifold

Unified Neural Network Scaling Laws and Scale-time Equivalence

Neural Scaling Laws Rooted in the Data Distribution

Scaling Laws for Neural Language Models

Scaling Laws in Linear Regression: Compute, Parameters, and Data

Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments

Scaling Laws for Autoregressive Generative Modeling

Scaling Laws for Transfer

Analyzing Neural Scaling Laws in Two-Layer Networks with Power-Law Data Spectra

Towards Neural Scaling Laws on Graphs

Broken Neural Scaling Laws

A Resource Model For Neural Scaling Law

How Feature Learning Can Improve Neural Scaling Laws

Revisiting Neural Scaling Laws in Language and Vision

Scaling Laws for the Value of Individual Data Points in Machine Learning

Scaling Laws with Hidden Structure