Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks

Rodrigo Veiga,Ludovic Stephan,Bruno Loureiro,Florent Krzakala,Lenka Zdeborová
2023-06-14
Abstract:Despite the non-convex optimization landscape, over-parametrized shallow networks are able to achieve global convergence under gradient descent. The picture can be radically different for narrow networks, which tend to get stuck in badly-generalizing local minima. Here we investigate the cross-over between these two regimes in the high-dimensional setting, and in particular investigate the connection between the so-called mean-field/hydrodynamic regime and the seminal approach of Saad & Solla. Focusing on the case of Gaussian data, we study the interplay between the learning rate, the time scale, and the number of hidden units in the high-dimensional dynamics of stochastic gradient descent (SGD). Our work builds on a deterministic description of SGD in high-dimensions from statistical physics, which we extend and for which we provide rigorous convergence rates.
Machine Learning,Disordered Systems and Neural Networks
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the different learning behaviors of two - layer neural networks when trained using Stochastic Gradient Descent (SGD) under high - dimensional data conditions. Specifically, it studies how different scaling methods of the learning rate \(\gamma\) and the hidden - layer width \(p\) with \(d\) affect the performance of SGD under the constraint of a high - dimensional input layer (i.e., dimension \(d \to \infty\)). The paper constructs a set of deterministic ordinary differential equations (ODEs) to describe the dynamics of SGD and, based on this, analyzes four different learning scenarios: 1. **Perfect learning (green region, \(\kappa > -\delta\))**: When the hidden - layer width \(p\) and the learning rate \(\gamma\) satisfy certain conditions, perfect learning (zero overall risk) can be asymptotically achieved even if the task contains additive noise. The required number of samples is \(n \sim d^{1+\kappa+\delta}\). 2. **Plateau (blue line, \(\kappa = -\delta\))**: Learning reaches a plateau related to the noise intensity. When \(\kappa=\delta = 0\), it reverts to the situation in classical works. 3. **Poor learning (orange region, \(-1/2 < \kappa+\delta < 0\))**: At this time, noise dominates the learning process, resulting in poor learning effects. 4. **No ODEs (red region, \(\kappa+\delta < -1/2\))**: The stochastic process associated with SGD is not guaranteed to converge to a set of deterministic ODEs, so this region is not within the scope of the paper's analysis. The main contributions of the paper are: - **C1**: Rigorously prove that the dynamics of SGD can be captured by a set of deterministic ODEs, extending previous work [6] to allow more general learning rates and time scales, as well as a wider range of hidden - layer widths. - **C2**: Based on the analysis of ODEs, provide a phase diagram of SGD for two - layer neural networks under the constraint of a high - dimensional input layer, describing possible learning scenarios. In addition, the paper also discusses the influence of initialization conditions on high - dimensional dynamics, especially the relationship between the unspecialized regime and the specialization transition. These analyses are helpful for understanding the learning behavior and optimization process of two - layer neural networks under high - dimensional data.