The Quantization Model of Neural Scaling

Eric J. Michaud,Ziming Liu,Uzay Girit,Max Tegmark
2024-01-14
Abstract:We propose the Quantization Model of neural scaling laws, explaining both the observed power law dropoff of loss with model and data size, and also the sudden emergence of new capabilities with scale. We derive this model from what we call the Quantization Hypothesis, where network knowledge and skills are "quantized" into discrete chunks ($\textbf{quanta}$). We show that when quanta are learned in order of decreasing use frequency, then a power law in use frequencies explains observed power law scaling of loss. We validate this prediction on toy datasets, then study how scaling curves decompose for large language models. Using language model gradients, we automatically decompose model behavior into a diverse set of skills (quanta). We tentatively find that the frequency at which these quanta are used in the training distribution roughly follows a power law corresponding with the empirical scaling exponent for language models, a prediction of our theory.
Machine Learning,Disordered Systems and Neural Networks
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper primarily explores certain characteristics exhibited by neural networks as their scale increases, specifically including the following points: 1. **Explaining the power-law decay phenomenon in neural scaling laws**: - The paper proposes a theoretical framework called the "quantization model" to explain the power-law decay phenomenon of the loss function as the model size and dataset scale increase. - This model is based on the "quantization hypothesis," which posits that the knowledge and skills of the network can be decomposed into discrete small chunks (quanta). 2. **Emergence of new capabilities**: - The paper also attempts to explain why new capabilities emerge as the model scale increases. - These new capabilities are fundamentally different from those of smaller models. 3. **Theoretical validation**: - By validating predictions on toy datasets, the paper demonstrates that when these "quanta" are learned in order of decreasing usage frequency, the power-law distribution of their usage frequency can explain the observed power-law scaling of the loss function. - Further research on large-scale language models was conducted, using the gradients of the language model to automatically decompose model behavior into a series of different skills (quanta). 4. **Empirical analysis**: - The paper finds that the usage frequency of these quanta in the training distribution roughly follows a power-law distribution corresponding to the empirical scaling exponent of language models, supporting their theoretical predictions. In summary, the main goal of this paper is to propose a new theoretical framework to explain the performance changes of neural networks during the scaling process and the underlying reasons, and to validate the effectiveness of this theory through empirical research.