A Neural Scaling Law from Lottery Ticket Ensembling

Ziming Liu,Max Tegmark
2024-02-02
Abstract:Neural scaling laws (NSL) refer to the phenomenon where model performance improves with scale. Sharma & Kaplan analyzed NSL using approximation theory and predict that MSE losses decay as $N^{-\alpha}$, $\alpha=4/d$, where $N$ is the number of model parameters, and $d$ is the intrinsic input dimension. Although their theory works well for some cases (e.g., ReLU networks), we surprisingly find that a simple 1D problem $y=x^2$ manifests a different scaling law ($\alpha=1$) from their predictions ($\alpha=4$). We opened the neural networks and found that the new scaling law originates from lottery ticket ensembling: a wider network on average has more "lottery tickets", which are ensembled to reduce the variance of outputs. We support the ensembling mechanism by mechanistically interpreting single neural networks, as well as studying them statistically. We attribute the $N^{-1}$ scaling law to the "central limit theorem" of lottery tickets. Finally, we discuss its potential implications for large language models and statistical physics-type theories of learning.
Machine Learning,Artificial Intelligence,Data Analysis, Statistics and Probability
What problem does this paper attempt to address?
The problem this paper attempts to address is the phenomenon of performance improvement in neural networks as they scale (i.e., Neural Scaling Laws, NSL). Specifically, the authors have discovered a new neural scaling law, where the loss function decays as \( N^{-1} \) with the increase in the number of model parameters. This finding differs from the predictions of existing theories, particularly for simple 1-dimensional problems like \( y = x^2 \), where current approximation theories fail to accurately predict this decay pattern. The authors explain this new scaling law by analyzing the "Lottery Tickets" mechanism, suggesting that wider networks generally possess more "lottery tickets," which can be integrated to reduce the variance of the output. Additionally, the authors explore the potential implications of this mechanism for large-scale language models and statistical physics-like theories of learning. In summary, the paper aims to reveal a new neural scaling mechanism and provide an understanding of the principles behind it.