Abstract:The phenomenon of model-wise double descent, where the test error peaks and then reduces as the model size increases, is an interesting topic that has attracted the attention of researchers due to the striking observed gap between theory and practice \citep{Belkin2018ReconcilingMM}. Additionally, while double descent has been observed in various tasks and architectures, the peak of double descent can sometimes be noticeably absent or diminished, even without explicit regularization, such as weight decay and early stopping. In this paper, we investigate this intriguing phenomenon from the optimization perspective and propose a simple optimization-based explanation for why double descent sometimes occurs weakly or not at all. To the best of our knowledge, we are the first to demonstrate that many disparate factors contributing to model-wise double descent (initialization, normalization, batch size, learning rate, optimization algorithm) are unified from the viewpoint of optimization: model-wise double descent is observed if and only if the optimizer can find a sufficiently low-loss minimum. These factors directly affect the condition number of the optimization problem or the optimizer and thus affect the final minimum found by the optimizer, reducing or increasing the height of the double descent peak. We conduct a series of controlled experiments on random feature models and two-layer neural networks under various optimization settings, demonstrating this optimization-based unified view. Our results suggest the following implication: Double descent is unlikely to be a problem for real-world machine learning setups. Additionally, our results help explain the gap between weak double descent peaks in practice and strong peaks observable in carefully designed setups.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to understand the role of optimization in the double - descent phenomenon and explain why in some cases the double - descent phenomenon is weak or does not exist at all. Specifically, the paper explores the following issues: 1. **Background of the double - descent phenomenon**: - The double - descent phenomenon refers to the phenomenon that the test error of a model first decreases, then increases, and then decreases again as the model size increases. This phenomenon challenges the traditional understanding of the generalization ability of models, especially in the case of over - parameterization. - Although the double - descent phenomenon has been observed in a variety of tasks and architectures, in practical applications, sometimes the peak of double - descent is not obvious or disappears completely, even without explicit regularization methods (such as weight decay and early stopping). 2. **Research objectives**: - From the perspective of optimization, explain why the double - descent phenomenon sometimes appears weak or does not occur. - Uniformly understand various factors that affect the double - descent phenomenon (such as initialization, normalization, batch size, learning rate, optimization algorithm, etc.), and explain how these factors affect the finally found minimum value through the condition number of the optimization problem. 3. **Main contributions**: - Propose a simple optimization - based explanation: whether the double - descent phenomenon occurs depends on whether the optimizer can find a minimum value with a low enough loss. - Prove that these different factors affect the finally found minimum value by affecting the condition number of the optimization problem or the optimizer itself, thereby changing the height of the double - descent peak. - Through a series of controlled experiments, show the results of random feature models and two - layer neural networks under different optimization settings, verifying this unified view. 4. **Practical significance**: - The research results show that in machine - learning settings in the real world, the double - descent phenomenon is unlikely to be a problem because measures are usually taken in practical applications to avoid over - fitting of models. - These findings help to explain the gap between the strong double - descent peak in theoretical research and the weak peak in practical applications. In summary, this paper aims to deeply understand the double - descent phenomenon from the perspective of optimization and provide theoretical support for model selection and parameter tuning in practical applications.

Understanding the Role of Optimization in Double Descent

Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle

Can we avoid Double Descent in Deep Neural Networks?

Understanding the Double Descent Phenomenon in Deep Learning

Phenomenology of Double Descent in Finite-Width Neural Networks

Manipulating Sparse Double Descent

Multi-scale Feature Learning Dynamics: Insights for Double Descent

Regularization-wise double descent: Why it occurs and how to eliminate it

Optimization Variance: Exploring Generalization Properties of DNNs

Understanding Optimization of Deep Learning via Jacobian Matrix and Lipschitz Constant

Sparse Double Descent: Where Network Pruning Aggravates Overfitting

When and how epochwise double descent happens

Beyond Single-Model Views for Deep Learning: Optimization versus Generalizability of Stochastic Optimization Algorithms

Towards Understanding Epoch-wise Double descent in Two-layer Linear Neural Networks

Enhancing Deep Learning with Optimized Gradient Descent: Bridging Numerical Methods and Neural Network Training

Least Squares Regression Can Exhibit Under-Parameterized Double Descent

Unraveling the Enigma of Double Descent: An In-depth Analysis through the Lens of Learned Feature Space

The Multiscale Structure of Neural Network Loss Functions: The Effect on Optimization and Origin

A U-turn on Double Descent: Rethinking Parameter Counting in Statistical Learning

Optimization for deep learning: theory and algorithms

Empirical Tests of Optimization Assumptions in Deep Learning