Abstract:We present a smoothly broken power law functional form (referred to by us as a Broken Neural Scaling Law (BNSL)) that accurately models and extrapolates the scaling behaviors of deep neural networks (i.e. how the evaluation metric of interest varies as the amount of compute used for training, number of model parameters, training dataset size, model input size, number of training steps, or upstream performance varies) for various architectures and for each of various tasks within a large and diverse set of upstream and downstream tasks, in zero-shot, prompted, and fine-tuned settings. This set includes large-scale vision, language, audio, video, diffusion, generative modeling, multimodal learning, contrastive learning, AI alignment, robotics, out-of-distribution (OOD) generalization, continual learning, transfer learning, uncertainty estimation / calibration, out-of-distribution detection, adversarial robustness, distillation, sparsity, retrieval, quantization, pruning, fairness, molecules, computer programming/coding, math word problems, "emergent" "phase transitions / changes", arithmetic, unsupervised/self-supervised learning, & reinforcement learning (single agent & multi-agent). When compared to other functional forms for neural scaling behavior, this functional form yields extrapolations of scaling behavior that are considerably more accurate on this set. Moreover, this functional form accurately models & extrapolates scaling behavior that other functional forms are incapable of expressing such as the non-monotonic transitions present in the scaling behavior of phenomena such as double descent & the delayed, sharp inflection points present in the scaling behavior of tasks such as arithmetic. Lastly, we use this functional form to glean insights about the limit of the predictability of scaling behavior. Code is available at https://github.com/ethancaballero/broken_neural_scaling_laws

How Feature Learning Can Improve Neural Scaling Laws

A Dynamical Model of Neural Scaling Laws

A Solvable Model of Neural Scaling Laws

Unified Neural Network Scaling Laws and Scale-time Equivalence

Explaining Neural Scaling Laws

Scaling Graph Neural Networks for Large-Scale Power Systems Analysis: Empirical Laws for Emergent Abilities

A Neural Scaling Law from the Dimension of the Data Manifold

Information-Theoretic Foundations for Neural Scaling Laws

4+3 Phases of Compute-Optimal Neural Scaling Laws

Broken Neural Scaling Laws

Scaling Laws in Linear Regression: Compute, Parameters, and Data

Analyzing Neural Scaling Laws in Two-Layer Networks with Power-Law Data Spectra

Revisiting Neural Scaling Laws in Language and Vision

A Spectral Condition for Feature Learning

How to Scale Up Kernel Methods to Be As Good As Deep Neural Nets

Neural Scaling Laws Rooted in the Data Distribution

Neural Scaling Laws From Large-N Field Theory: Solvable Model Beyond the Ridgeless Limit

Feature Learning in Infinite-Width Neural Networks

Scaling Laws for Neural Language Models

A Resource Model For Neural Scaling Law

Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments