Abstract:The phenomenon of model-wise double descent, where the test error peaks and then reduces as the model size increases, is an interesting topic that has attracted the attention of researchers due to the striking observed gap between theory and practice \citep{Belkin2018ReconcilingMM}. Additionally, while double descent has been observed in various tasks and architectures, the peak of double descent can sometimes be noticeably absent or diminished, even without explicit regularization, such as weight decay and early stopping. In this paper, we investigate this intriguing phenomenon from the optimization perspective and propose a simple optimization-based explanation for why double descent sometimes occurs weakly or not at all. To the best of our knowledge, we are the first to demonstrate that many disparate factors contributing to model-wise double descent (initialization, normalization, batch size, learning rate, optimization algorithm) are unified from the viewpoint of optimization: model-wise double descent is observed if and only if the optimizer can find a sufficiently low-loss minimum. These factors directly affect the condition number of the optimization problem or the optimizer and thus affect the final minimum found by the optimizer, reducing or increasing the height of the double descent peak. We conduct a series of controlled experiments on random feature models and two-layer neural networks under various optimization settings, demonstrating this optimization-based unified view. Our results suggest the following implication: Double descent is unlikely to be a problem for real-world machine learning setups. Additionally, our results help explain the gap between weak double descent peaks in practice and strong peaks observable in carefully designed setups.

When and how epochwise double descent happens

Towards Understanding Epoch-wise Double descent in Two-layer Linear Neural Networks

Multi-scale Feature Learning Dynamics: Insights for Double Descent

Can we avoid Double Descent in Deep Neural Networks?

Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle

Regularization-wise double descent: Why it occurs and how to eliminate it

Understanding the Double Descent Phenomenon in Deep Learning

Phenomenology of Double Descent in Finite-Width Neural Networks

Understanding the Role of Optimization in Double Descent

Manipulating Sparse Double Descent

Dropout Drops Double Descent

Double Descent of Discrepancy: A Task-, Data-, and Model-Agnostic Phenomenon.

DSD$^2$: Can We Dodge Sparse Double Descent and Compress the Neural Network Worry-Free?

Optimization Variance: Exploring Generalization Properties of DNNs

Unraveling the Enigma of Double Descent: An In-depth Analysis through the Lens of Learned Feature Space

The twin peaks of learning neural networks

Sparse Double Descent: Where Network Pruning Aggravates Overfitting

On Multi-Stage Loss Dynamics in Neural Networks: Mechanisms of Plateau and Descent Stages

Can Stability be Detrimental? Better Generalization through Gradient Descent Instabilities

Multiple Descents in Unsupervised Learning: The Role of Noise, Domain Shift and Anomalies

Learning Stages: Phenomenon, Root Cause, Mechanism Hypothesis, and Implications