4+3 Phases of Compute-Optimal Neural Scaling Laws

Elliot Paquette,Courtney Paquette,Lechao Xiao,Jeffrey Pennington
2024-05-24
Abstract:We consider the three parameter solvable neural scaling model introduced by Maloney, Roberts, and Sully. The model has three parameters: data complexity, target complexity, and model-parameter-count. We use this neural scaling model to derive new predictions about the compute-limited, infinite-data scaling law regime. To train the neural scaling model, we run one-pass stochastic gradient descent on a mean-squared loss. We derive a representation of the loss curves which holds over all iteration counts and improves in accuracy as the model parameter count grows. We then analyze the compute-optimal model-parameter-count, and identify 4 phases (+3 subphases) in the data-complexity/target-complexity phase-plane. The phase boundaries are determined by the relative importance of model capacity, optimizer noise, and embedding of the features. We furthermore derive, with mathematical proof and extensive numerical evidence, the scaling-law exponents in all of these phases, in particular computing the optimal model-parameter-count as a function of floating point operation budget.
Machine Learning,Optimization and Control,Probability,Statistics Theory
What problem does this paper attempt to address?
This paper discusses the law of optimal model expansion for neural network models, with a focus on how to choose the model size to minimize loss under the condition of infinite data and fixed computational budget. The study is based on the three-parameter solvable neural scaling model proposed by Maloney, Roberts, and Sully, which includes three parameters: data complexity, target complexity, and model parameter count. The authors train the model using one-shot stochastic gradient descent (SGD) and analyze the representation of the loss curve as the model parameter count increases, thus determining the behavior of the four (plus three sub-stages) optimal calculation curves on the data complexity/target complexity plane. The main contributions of this paper include: 1. Proposing a three-parameter model called Power Law Random Feature (PLRF) for analyzing training dynamics and predicting scale laws. 2. Determining the exact expression for calculating the optimal parameters, as well as the optimal parameter count for a large number of parameters, and estimating the order of parameter count required to achieve these scale laws. 3. Discovering a universal scaling behavior, which shows that in certain regions, the optimal parameter count is proportional to the square root of the number of floating-point operations. 4. Describing the optimal calculation curves constrained by model capacity or feature embedding quality, and SGD noise control. In the paper, the authors analyze using the PLRF model and capture the dynamics of SGD training through deterministic equivalents. They then identify four different stages by analyzing the data and target complexity (α and β), each with different optimal calculation curves and loss curve behaviors. Furthermore, they discuss the relationship between the optimal model size and computational effort in different stages, and observe that in some cases, the model size is proportional to the square root of the computational effort. In conclusion, this paper aims to address the problem of choosing the optimal model size based on architecture to minimize loss under the condition of fixed computational budget and infinite data. By analyzing the interaction between the model and algorithm, the paper reveals key factors that influence computational efficiency and proposes new prediction and theoretical foundations.