Abstract:We consider the three parameter solvable neural scaling model introduced by Maloney, Roberts, and Sully. The model has three parameters: data complexity, target complexity, and model-parameter-count. We use this neural scaling model to derive new predictions about the compute-limited, infinite-data scaling law regime. To train the neural scaling model, we run one-pass stochastic gradient descent on a mean-squared loss. We derive a representation of the loss curves which holds over all iteration counts and improves in accuracy as the model parameter count grows. We then analyze the compute-optimal model-parameter-count, and identify 4 phases (+3 subphases) in the data-complexity/target-complexity phase-plane. The phase boundaries are determined by the relative importance of model capacity, optimizer noise, and embedding of the features. We furthermore derive, with mathematical proof and extensive numerical evidence, the scaling-law exponents in all of these phases, in particular computing the optimal model-parameter-count as a function of floating point operation budget.

What problem does this paper attempt to address?

This paper discusses the law of optimal model expansion for neural network models, with a focus on how to choose the model size to minimize loss under the condition of infinite data and fixed computational budget. The study is based on the three-parameter solvable neural scaling model proposed by Maloney, Roberts, and Sully, which includes three parameters: data complexity, target complexity, and model parameter count. The authors train the model using one-shot stochastic gradient descent (SGD) and analyze the representation of the loss curve as the model parameter count increases, thus determining the behavior of the four (plus three sub-stages) optimal calculation curves on the data complexity/target complexity plane. The main contributions of this paper include: 1. Proposing a three-parameter model called Power Law Random Feature (PLRF) for analyzing training dynamics and predicting scale laws. 2. Determining the exact expression for calculating the optimal parameters, as well as the optimal parameter count for a large number of parameters, and estimating the order of parameter count required to achieve these scale laws. 3. Discovering a universal scaling behavior, which shows that in certain regions, the optimal parameter count is proportional to the square root of the number of floating-point operations. 4. Describing the optimal calculation curves constrained by model capacity or feature embedding quality, and SGD noise control. In the paper, the authors analyze using the PLRF model and capture the dynamics of SGD training through deterministic equivalents. They then identify four different stages by analyzing the data and target complexity (α and β), each with different optimal calculation curves and loss curve behaviors. Furthermore, they discuss the relationship between the optimal model size and computational effort in different stages, and observe that in some cases, the model size is proportional to the square root of the computational effort. In conclusion, this paper aims to address the problem of choosing the optimal model size based on architecture to minimize loss under the condition of fixed computational budget and infinite data. By analyzing the interaction between the model and algorithm, the paper reveals key factors that influence computational efficiency and proposes new prediction and theoretical foundations.

4+3 Phases of Compute-Optimal Neural Scaling Laws

A Dynamical Model of Neural Scaling Laws

A Solvable Model of Neural Scaling Laws

An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws

Broken Neural Scaling Laws

Navigating Scaling Laws: Compute Optimality in Adaptive Model Training

Scaling Graph Neural Networks for Large-Scale Power Systems Analysis: Empirical Laws for Emergent Abilities

Information-Theoretic Foundations for Neural Scaling Laws

A Resource Model For Neural Scaling Law

Unified Neural Network Scaling Laws and Scale-time Equivalence

How Feature Learning Can Improve Neural Scaling Laws

Scaling Laws in Linear Regression: Compute, Parameters, and Data

Explaining Neural Scaling Laws

Scaling Laws for Neural Language Models

Neural Scaling Laws From Large-N Field Theory: Solvable Model Beyond the Ridgeless Limit

Neural Scaling Laws Rooted in the Data Distribution

An exactly solvable model for emergence and scaling laws in the multitask sparse parity problem

Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments

A Neural Scaling Law from the Dimension of the Data Manifold

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

Analyzing Neural Scaling Laws in Two-Layer Networks with Power-Law Data Spectra