Abstract:We present a smoothly broken power law functional form (referred to by us as a Broken Neural Scaling Law (BNSL)) that accurately models and extrapolates the scaling behaviors of deep neural networks (i.e. how the evaluation metric of interest varies as the amount of compute used for training, number of model parameters, training dataset size, model input size, number of training steps, or upstream performance varies) for various architectures and for each of various tasks within a large and diverse set of upstream and downstream tasks, in zero-shot, prompted, and fine-tuned settings. This set includes large-scale vision, language, audio, video, diffusion, generative modeling, multimodal learning, contrastive learning, AI alignment, robotics, out-of-distribution (OOD) generalization, continual learning, transfer learning, uncertainty estimation / calibration, out-of-distribution detection, adversarial robustness, distillation, sparsity, retrieval, quantization, pruning, fairness, molecules, computer programming/coding, math word problems, "emergent" "phase transitions / changes", arithmetic, unsupervised/self-supervised learning, & reinforcement learning (single agent & multi-agent). When compared to other functional forms for neural scaling behavior, this functional form yields extrapolations of scaling behavior that are considerably more accurate on this set. Moreover, this functional form accurately models & extrapolates scaling behavior that other functional forms are incapable of expressing such as the non-monotonic transitions present in the scaling behavior of phenomena such as double descent & the delayed, sharp inflection points present in the scaling behavior of tasks such as arithmetic. Lastly, we use this functional form to glean insights about the limit of the predictability of scaling behavior. Code is available at https://github.com/ethancaballero/broken_neural_scaling_laws

Broken Neural Scaling Laws

Scaling Graph Neural Networks for Large-Scale Power Systems Analysis: Empirical Laws for Emergent Abilities

A Dynamical Model of Neural Scaling Laws

Neural Scaling Laws Rooted in the Data Distribution

Explaining Neural Scaling Laws

4+3 Phases of Compute-Optimal Neural Scaling Laws

A Solvable Model of Neural Scaling Laws

Information-Theoretic Foundations for Neural Scaling Laws

How Feature Learning Can Improve Neural Scaling Laws

Unified Neural Network Scaling Laws and Scale-time Equivalence

Scaling Laws for Neural Language Models

Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments

A Resource Model For Neural Scaling Law

A Neural Scaling Law from the Dimension of the Data Manifold

Analyzing Neural Scaling Laws in Two-Layer Networks with Power-Law Data Spectra

A Neural Scaling Law from Lottery Ticket Ensembling

An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws

Power-law Scaling to Assist with Key Challenges in Artificial Intelligence

Revisiting Neural Scaling Laws in Language and Vision

Towards Neural Scaling Laws on Graphs

Scaling Laws in Linear Regression: Compute, Parameters, and Data