Abstract:Despite the non-convex optimization landscape, over-parametrized shallow networks are able to achieve global convergence under gradient descent. The picture can be radically different for narrow networks, which tend to get stuck in badly-generalizing local minima. Here we investigate the cross-over between these two regimes in the high-dimensional setting, and in particular investigate the connection between the so-called mean-field/hydrodynamic regime and the seminal approach of Saad & Solla. Focusing on the case of Gaussian data, we study the interplay between the learning rate, the time scale, and the number of hidden units in the high-dimensional dynamics of stochastic gradient descent (SGD). Our work builds on a deterministic description of SGD in high-dimensions from statistical physics, which we extend and for which we provide rigorous convergence rates.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the different learning behaviors of two - layer neural networks when trained using Stochastic Gradient Descent (SGD) under high - dimensional data conditions. Specifically, it studies how different scaling methods of the learning rate \(\gamma\) and the hidden - layer width \(p\) with \(d\) affect the performance of SGD under the constraint of a high - dimensional input layer (i.e., dimension \(d \to \infty\)). The paper constructs a set of deterministic ordinary differential equations (ODEs) to describe the dynamics of SGD and, based on this, analyzes four different learning scenarios: 1. **Perfect learning (green region, \(\kappa > -\delta\))**: When the hidden - layer width \(p\) and the learning rate \(\gamma\) satisfy certain conditions, perfect learning (zero overall risk) can be asymptotically achieved even if the task contains additive noise. The required number of samples is \(n \sim d^{1+\kappa+\delta}\). 2. **Plateau (blue line, \(\kappa = -\delta\))**: Learning reaches a plateau related to the noise intensity. When \(\kappa=\delta = 0\), it reverts to the situation in classical works. 3. **Poor learning (orange region, \(-1/2 < \kappa+\delta < 0\))**: At this time, noise dominates the learning process, resulting in poor learning effects. 4. **No ODEs (red region, \(\kappa+\delta < -1/2\))**: The stochastic process associated with SGD is not guaranteed to converge to a set of deterministic ODEs, so this region is not within the scope of the paper's analysis. The main contributions of the paper are: - **C1**: Rigorously prove that the dynamics of SGD can be captured by a set of deterministic ODEs, extending previous work [6] to allow more general learning rates and time scales, as well as a wider range of hidden - layer widths. - **C2**: Based on the analysis of ODEs, provide a phase diagram of SGD for two - layer neural networks under the constraint of a high - dimensional input layer, describing possible learning scenarios. In addition, the paper also discusses the influence of initialization conditions on high - dimensional dynamics, especially the relationship between the unspecialized regime and the specialization transition. These analyses are helpful for understanding the learning behavior and optimization process of two - layer neural networks under high - dimensional data.

Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks

Stochastic Gradient Descent for Two-layer Neural Networks

On the different regimes of Stochastic Gradient Descent

Convergence of stochastic gradient descent under a local Lojasiewicz condition for deep neural networks

Stochasticity helps to navigate rough landscapes: comparing gradient-descent-based algorithms in the phase retrieval problem

Convergence Analysis of Two-layer Neural Networks with ReLU Activation

Dynamic of Stochastic Gradient Descent with State-Dependent Noise

The Limiting Dynamics of SGD: Modified Loss, Phase Space Oscillations, and Anomalous Diffusion

Dynamical mean-field theory for stochastic gradient descent in Gaussian mixture classification

Learning Time-Scales in Two-Layers Neural Networks

Convergence of Stochastic Gradient Descent in Deep Neural Network

Novel Convergence Results of Adaptive Stochastic Gradient Descents

Global Convergence Analysis of Local SGD for Two-layer Neural Network Without Overparameterization

A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks

On the convergence of gradient descent for two layer neural networks

Stochastic Gradient Descent outperforms Gradient Descent in recovering a high-dimensional signal in a glassy energy landscape

Convergence Analysis of Natural Gradient Descent for Over-parameterized Physics-Informed Neural Networks

On the Unstable Convergence Regime of Gradient Descent

Stochastic Gradient Descent and Anomaly of Variance-flatness Relation in Artificial Neural Networks

On the Convergence of Gradient Descent for Large Learning Rates