Understanding the Role of Optimization in Double Descent

Chris Yuhao Liu,Jeffrey Flanigan
2023-12-07
Abstract:The phenomenon of model-wise double descent, where the test error peaks and then reduces as the model size increases, is an interesting topic that has attracted the attention of researchers due to the striking observed gap between theory and practice \citep{Belkin2018ReconcilingMM}. Additionally, while double descent has been observed in various tasks and architectures, the peak of double descent can sometimes be noticeably absent or diminished, even without explicit regularization, such as weight decay and early stopping. In this paper, we investigate this intriguing phenomenon from the optimization perspective and propose a simple optimization-based explanation for why double descent sometimes occurs weakly or not at all. To the best of our knowledge, we are the first to demonstrate that many disparate factors contributing to model-wise double descent (initialization, normalization, batch size, learning rate, optimization algorithm) are unified from the viewpoint of optimization: model-wise double descent is observed if and only if the optimizer can find a sufficiently low-loss minimum. These factors directly affect the condition number of the optimization problem or the optimizer and thus affect the final minimum found by the optimizer, reducing or increasing the height of the double descent peak. We conduct a series of controlled experiments on random feature models and two-layer neural networks under various optimization settings, demonstrating this optimization-based unified view. Our results suggest the following implication: Double descent is unlikely to be a problem for real-world machine learning setups. Additionally, our results help explain the gap between weak double descent peaks in practice and strong peaks observable in carefully designed setups.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to understand the role of optimization in the double - descent phenomenon and explain why in some cases the double - descent phenomenon is weak or does not exist at all. Specifically, the paper explores the following issues: 1. **Background of the double - descent phenomenon**: - The double - descent phenomenon refers to the phenomenon that the test error of a model first decreases, then increases, and then decreases again as the model size increases. This phenomenon challenges the traditional understanding of the generalization ability of models, especially in the case of over - parameterization. - Although the double - descent phenomenon has been observed in a variety of tasks and architectures, in practical applications, sometimes the peak of double - descent is not obvious or disappears completely, even without explicit regularization methods (such as weight decay and early stopping). 2. **Research objectives**: - From the perspective of optimization, explain why the double - descent phenomenon sometimes appears weak or does not occur. - Uniformly understand various factors that affect the double - descent phenomenon (such as initialization, normalization, batch size, learning rate, optimization algorithm, etc.), and explain how these factors affect the finally found minimum value through the condition number of the optimization problem. 3. **Main contributions**: - Propose a simple optimization - based explanation: whether the double - descent phenomenon occurs depends on whether the optimizer can find a minimum value with a low enough loss. - Prove that these different factors affect the finally found minimum value by affecting the condition number of the optimization problem or the optimizer itself, thereby changing the height of the double - descent peak. - Through a series of controlled experiments, show the results of random feature models and two - layer neural networks under different optimization settings, verifying this unified view. 4. **Practical significance**: - The research results show that in machine - learning settings in the real world, the double - descent phenomenon is unlikely to be a problem because measures are usually taken in practical applications to avoid over - fitting of models. - These findings help to explain the gap between the strong double - descent peak in theoretical research and the weak peak in practical applications. In summary, this paper aims to deeply understand the double - descent phenomenon from the perspective of optimization and provide theoretical support for model selection and parameter tuning in practical applications.