Position: Understanding LLMs Requires More Than Statistical Generalization

Patrik Reizinger,Szilvia Ujváry,Anna Mészáros,Anna Kerekes,Wieland Brendel,Ferenc Huszár
2024-06-17
Abstract:The last decade has seen blossoming research in deep learning theory attempting to answer, "Why does deep learning generalize?" A powerful shift in perspective precipitated this progress: the study of overparametrized models in the interpolation regime. In this paper, we argue that another perspective shift is due, since some of the desirable qualities of LLMs are not a consequence of good statistical generalization and require a separate theoretical explanation. Our core argument relies on the observation that AR probabilistic models are inherently non-identifiable: models zero or near-zero KL divergence apart -- thus, equivalent test loss -- can exhibit markedly different behaviors. We support our position with mathematical examples and empirical observations, illustrating why non-identifiability has practical relevance through three case studies: (1) the non-identifiability of zero-shot rule extrapolation; (2) the approximate non-identifiability of in-context learning; and (3) the non-identifiability of fine-tunability. We review promising research directions focusing on LLM-relevant generalization measures, transferability, and inductive biases.
Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper explores the limitations of current deep - learning theories in explaining the performance of large language models (LLMs). In particular, it points out that statistical generalization cannot fully explain some key characteristics of LLMs. The author believes that these models need to be studied from new perspectives, especially in the "saturation regime", in order to reveal other important properties beyond statistical generalization. #### Main problems 1. **Limitations of statistical generalization**: - The paper points out that although large language models can achieve low losses on both the training set and the test set (i.e., perform well in a statistical sense), this does not guarantee their performance on downstream tasks. - Statistical generalization cannot explain some characteristics of LLMs, such as zero - sample reasoning, in - context learning (ICL), and fine - tuning efficiency. 2. **Non - uniqueness and non - identifiability**: - In the "saturation regime", multiple models may reach the same minimum test loss, but their behaviors can be significantly different. This is because autoregressive (AR) probability models are essentially non - identifiable: even if two models are very close in KL - divergence, their behaviors in handling low - probability sequences may be completely different. - This non - identifiability affects LLMs in terms of zero - sample rule extrapolation, in - context learning, and fine - tuning ability. 3. **Importance of inductive biases**: - The author emphasizes that in the "saturation regime", it is very important to understand and study the inductive biases that lead to good performance. For example, different parameterizations may lead to the same test loss, but there are differences in the performance of downstream tasks after fine - tuning. #### Specific case analysis - **Zero - sample rule extrapolation**: - Through experiments, the author shows that even for patterns that have not appeared in the training data, LLMs can still make reasonable extrapolations through inductive biases. - **In - context learning (ICL)**: - The author proves that even in the case of infinite data, the in - context learning ability of some LLMs may be ε - non - identifiable, that is, there is another model that is very close in KL - divergence but does not have this ability. - **Parameter non - identifiability in fine - tuning**: - Different parameterizations may show significant differences during the fine - tuning process, which indicates that we need to better understand which parameterizations are helpful for improving the effects of fine - tuning and transfer learning. #### Conclusion The paper believes that in order to more comprehensively understand LLMs, we need to go beyond the traditional statistical generalization framework and turn to studying the model behaviors in the "saturation regime", especially those characteristics driven by inductive biases. This will help to reveal the potential and limitations of LLMs in practical applications and provide new directions for future research. ### Formula summary 1. **KL - divergence**: \[ D_{\text{KL}}(P \parallel Q)=\sum_{x} P(x)\log\frac{P(x)}{Q(x)} \] 2. **Conditions for zero - sample rule extrapolation**: \[ \text{If}\;KL[p(x_{1:k})\parallel q(x_{1:k})] = 0\;\text{then}\;q\;\text{perfectly generalizes in a statistical sense} \] 3. **Definition of in - context learning**: \[ \argmax_y p(y|S_n,x_{\text{test}})\to\argmax_y p_{\text{prompt}}(y|x_{\text{test}})\;\text{when}\;n\to\infty \] Through these formulas and case analyses, the paper emphasizes the importance of studying LLMs in the "saturation regime" and the key role of inductive biases in it.