Abstract:We present a smoothly broken power law functional form (referred to by us as a Broken Neural Scaling Law (BNSL)) that accurately models and extrapolates the scaling behaviors of deep neural networks (i.e. how the evaluation metric of interest varies as the amount of compute used for training, number of model parameters, training dataset size, model input size, number of training steps, or upstream performance varies) for various architectures and for each of various tasks within a large and diverse set of upstream and downstream tasks, in zero-shot, prompted, and fine-tuned settings. This set includes large-scale vision, language, audio, video, diffusion, generative modeling, multimodal learning, contrastive learning, AI alignment, robotics, out-of-distribution (OOD) generalization, continual learning, transfer learning, uncertainty estimation / calibration, out-of-distribution detection, adversarial robustness, distillation, sparsity, retrieval, quantization, pruning, fairness, molecules, computer programming/coding, math word problems, "emergent" "phase transitions / changes", arithmetic, unsupervised/self-supervised learning, & reinforcement learning (single agent & multi-agent). When compared to other functional forms for neural scaling behavior, this functional form yields extrapolations of scaling behavior that are considerably more accurate on this set. Moreover, this functional form accurately models & extrapolates scaling behavior that other functional forms are incapable of expressing such as the non-monotonic transitions present in the scaling behavior of phenomena such as double descent & the delayed, sharp inflection points present in the scaling behavior of tasks such as arithmetic. Lastly, we use this functional form to glean insights about the limit of the predictability of scaling behavior. Code is available at https://github.com/ethancaballero/broken_neural_scaling_laws

A Neural Scaling Law from the Dimension of the Data Manifold

Explaining Neural Scaling Laws

Neural Scaling Laws Rooted in the Data Distribution

Scaling Laws for Neural Language Models

A Dynamical Model of Neural Scaling Laws

Scaling Laws in Linear Regression: Compute, Parameters, and Data

A Neural Scaling Law from Lottery Ticket Ensembling

A Solvable Model of Neural Scaling Laws

Unified Neural Network Scaling Laws and Scale-time Equivalence

Analyzing Neural Scaling Laws in Two-Layer Networks with Power-Law Data Spectra

Scaling Graph Neural Networks for Large-Scale Power Systems Analysis: Empirical Laws for Emergent Abilities

Information-Theoretic Foundations for Neural Scaling Laws

How Feature Learning Can Improve Neural Scaling Laws

Understanding Scaling Laws with Statistical and Approximation Theory for Transformer Neural Networks on Intrinsically Low-dimensional Data

Scaling Laws for Transfer

Scaling Laws for Autoregressive Generative Modeling

Neural Scaling Laws From Large-N Field Theory: Solvable Model Beyond the Ridgeless Limit

Broken Neural Scaling Laws

An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws

Information Scaling Law of Deep Neural Networks

Neural Scaling Laws of Deep ReLU and Deep Operator Network: A Theoretical Study