Abstract:We propose the Quantization Model of neural scaling laws, explaining both the observed power law dropoff of loss with model and data size, and also the sudden emergence of new capabilities with scale. We derive this model from what we call the Quantization Hypothesis, where network knowledge and skills are "quantized" into discrete chunks ($\textbf{quanta}$). We show that when quanta are learned in order of decreasing use frequency, then a power law in use frequencies explains observed power law scaling of loss. We validate this prediction on toy datasets, then study how scaling curves decompose for large language models. Using language model gradients, we automatically decompose model behavior into a diverse set of skills (quanta). We tentatively find that the frequency at which these quanta are used in the training distribution roughly follows a power law corresponding with the empirical scaling exponent for language models, a prediction of our theory.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper primarily explores certain characteristics exhibited by neural networks as their scale increases, specifically including the following points: 1. **Explaining the power-law decay phenomenon in neural scaling laws**: - The paper proposes a theoretical framework called the "quantization model" to explain the power-law decay phenomenon of the loss function as the model size and dataset scale increase. - This model is based on the "quantization hypothesis," which posits that the knowledge and skills of the network can be decomposed into discrete small chunks (quanta). 2. **Emergence of new capabilities**: - The paper also attempts to explain why new capabilities emerge as the model scale increases. - These new capabilities are fundamentally different from those of smaller models. 3. **Theoretical validation**: - By validating predictions on toy datasets, the paper demonstrates that when these "quanta" are learned in order of decreasing usage frequency, the power-law distribution of their usage frequency can explain the observed power-law scaling of the loss function. - Further research on large-scale language models was conducted, using the gradients of the language model to automatically decompose model behavior into a series of different skills (quanta). 4. **Empirical analysis**: - The paper finds that the usage frequency of these quanta in the training distribution roughly follows a power-law distribution corresponding to the empirical scaling exponent of language models, supporting their theoretical predictions. In summary, the main goal of this paper is to propose a new theoretical framework to explain the performance changes of neural networks during the scaling process and the underlying reasons, and to validate the effectiveness of this theory through empirical research.

The Quantization Model of Neural Scaling

Scaling Laws for Neural Language Models

A Solvable Model of Neural Scaling Laws

Scaling Laws for Precision

Explaining Neural Scaling Laws

Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview

Scaling Laws for Mixed quantization in Large Language Models

Neural Scaling Laws Rooted in the Data Distribution

Intriguing Properties of Quantization at Scale

Scaling laws for post-training quantized large language models

An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws

A Dynamical Model of Neural Scaling Laws

Observational Scaling Laws and the Predictability of Language Model Performance

An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models

Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens

A Simplistic Model of Neural Scaling Laws: Multiperiodic Santa Fe Processes

Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments

When Quantization Affects Confidence of Large Language Models?

Exploring Extreme Quantization in Spiking Language Models

MWQ: Multiscale Wavelet Quantized Neural Networks

A Neural Scaling Law from Lottery Ticket Ensembling