Abstract:The remarkable success of large language pretraining and the discovery of scaling laws signify a paradigm shift in machine learning. Notably, the primary objective has evolved from minimizing generalization error to reducing approximation error, and the most effective strategy has transitioned from regularization (in a broad sense) to scaling up models. This raises a critical question: Do the established principles that proved successful in the generalization-centric era remain valid in this new era of scaling? This paper examines several influential regularization-based principles that may no longer hold true in the scaling-centric, large language model (LLM) era. These principles include explicit L2 regularization and implicit regularization through small batch sizes and large learning rates. Additionally, we identify a new phenomenon termed ``scaling law crossover,'' where two scaling curves intersect at a certain scale, implying that methods effective at smaller scales may not generalize to larger ones. Together, these observations highlight two fundamental questions within this new paradigm: $\bullet$ Guiding Principles for Scaling: If regularization is no longer the primary guiding principle for model design, what new principles are emerging to guide scaling? $\bullet$ Model Comparison at Scale: How to reliably and effectively compare models at the scale where only a single experiment is feasible?

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper attempts to explore the new paradigm shift from generalization to scaling in the field of machine learning and re - examine the effectiveness of traditional regularization principles under this new paradigm. Specifically, the paper focuses on the following core issues: 1. **Principles guiding scaling**: - If regularization (reducing over - fitting) is no longer the main guiding principle for designing machine - learning models, what are the new guiding principles? - How can performance be improved by scaling models, especially on large - scale datasets? 2. **Model comparison**: - In large - scale scenarios where only one experiment can be carried out, how can the performance of different models be compared reliably and effectively? ### Background and motivation The traditional machine - learning paradigm mainly focuses on generalization ability, that is, reducing over - fitting on limited small datasets through various regularization techniques (such as L2 regularization, mini - batch training, large learning rate, etc.). However, with the success of large - scale language pre - training models (such as BERT, GPT, etc.) and the discovery of the scaling law, the paradigm of machine learning has changed significantly. The new paradigm focuses more on reducing approximation error by scaling the model size rather than reducing generalization error. ### Main observations and conclusions 1. **Effectiveness of traditional regularization principles**: - The paper proves through experiments that traditional regularization techniques (such as L2 regularization, mini - batch training, large learning rate, etc.) may no longer be effective under the new scaling paradigm. For example, in large - scale language model pre - training, L2 regularization does not significantly improve test performance, while weight decay shows a certain performance improvement. 2. **Scaling law crossover phenomenon**: - The paper proposes a new phenomenon - "scaling law crossover", that is, techniques that are effective at certain scales may no longer be effective at larger scales. This indicates that in large - scale scenarios, new methods and theories are needed to guide the design and optimization of models. 3. **New guiding principles**: - New guiding principles need to be developed to understand and support model scaling. These principles may include but are not limited to new optimization algorithms, architecture designs and training strategies. ### Conclusion By re - examining the traditional regularization principles, the paper proposes that new guiding principles are needed to guide model design and optimization under the new scaling paradigm. At the same time, the paper emphasizes the challenges of model comparison in large - scale scenarios and proposes the "scaling law crossover" phenomenon, providing new directions and ideas for future research.

Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling

Has LLM Reached the Scaling Ceiling Yet? Unified Insights into LLM Regularities and Constraints

Scaling Laws for Discriminative Classification in Large Language Models

A Solvable Model of Neural Scaling Laws

Revisiting Neural Scaling Laws in Language and Vision

Scaling Laws Do Not Scale

Studying Large Language Model Generalization with Influence Functions

Scaling Laws Across Model Architectures: A Comparative Analysis of Dense and MoE Models in Large Language Models

Observational Scaling Laws and the Predictability of Language Model Performance

A Hitchhiker's Guide to Scaling Law Estimation

Inverse Scaling: When Bigger Isn't Better

Scaling Law for Language Models Training Considering Batch Size

Scaling Laws for Multilingual Language Models

Machine Learning vs Deep Learning: The Generalization Problem

Temporal Scaling Law for Large Language Models

Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function

Understanding deep learning (still) requires rethinking generalization

Scaling Laws in Linear Regression: Compute, Parameters, and Data

An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models

Unlock Predictable Scaling from Emergent Abilities

Scaling Laws for Neural Language Models