Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling

Lechao Xiao
2024-09-24
Abstract:The remarkable success of large language pretraining and the discovery of scaling laws signify a paradigm shift in machine learning. Notably, the primary objective has evolved from minimizing generalization error to reducing approximation error, and the most effective strategy has transitioned from regularization (in a broad sense) to scaling up models. This raises a critical question: Do the established principles that proved successful in the generalization-centric era remain valid in this new era of scaling? This paper examines several influential regularization-based principles that may no longer hold true in the scaling-centric, large language model (LLM) era. These principles include explicit L2 regularization and implicit regularization through small batch sizes and large learning rates. Additionally, we identify a new phenomenon termed ``scaling law crossover,'' where two scaling curves intersect at a certain scale, implying that methods effective at smaller scales may not generalize to larger ones. Together, these observations highlight two fundamental questions within this new paradigm: $\bullet$ Guiding Principles for Scaling: If regularization is no longer the primary guiding principle for model design, what new principles are emerging to guide scaling? $\bullet$ Model Comparison at Scale: How to reliably and effectively compare models at the scale where only a single experiment is feasible?
Machine Learning
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper attempts to explore the new paradigm shift from generalization to scaling in the field of machine learning and re - examine the effectiveness of traditional regularization principles under this new paradigm. Specifically, the paper focuses on the following core issues: 1. **Principles guiding scaling**: - If regularization (reducing over - fitting) is no longer the main guiding principle for designing machine - learning models, what are the new guiding principles? - How can performance be improved by scaling models, especially on large - scale datasets? 2. **Model comparison**: - In large - scale scenarios where only one experiment can be carried out, how can the performance of different models be compared reliably and effectively? ### Background and motivation The traditional machine - learning paradigm mainly focuses on generalization ability, that is, reducing over - fitting on limited small datasets through various regularization techniques (such as L2 regularization, mini - batch training, large learning rate, etc.). However, with the success of large - scale language pre - training models (such as BERT, GPT, etc.) and the discovery of the scaling law, the paradigm of machine learning has changed significantly. The new paradigm focuses more on reducing approximation error by scaling the model size rather than reducing generalization error. ### Main observations and conclusions 1. **Effectiveness of traditional regularization principles**: - The paper proves through experiments that traditional regularization techniques (such as L2 regularization, mini - batch training, large learning rate, etc.) may no longer be effective under the new scaling paradigm. For example, in large - scale language model pre - training, L2 regularization does not significantly improve test performance, while weight decay shows a certain performance improvement. 2. **Scaling law crossover phenomenon**: - The paper proposes a new phenomenon - "scaling law crossover", that is, techniques that are effective at certain scales may no longer be effective at larger scales. This indicates that in large - scale scenarios, new methods and theories are needed to guide the design and optimization of models. 3. **New guiding principles**: - New guiding principles need to be developed to understand and support model scaling. These principles may include but are not limited to new optimization algorithms, architecture designs and training strategies. ### Conclusion By re - examining the traditional regularization principles, the paper proposes that new guiding principles are needed to guide model design and optimization under the new scaling paradigm. At the same time, the paper emphasizes the challenges of model comparison in large - scale scenarios and proposes the "scaling law crossover" phenomenon, providing new directions and ideas for future research.