Abstract:Characterizing and understanding the dynamics of stochastic gradient descent (SGD) around saddle points remains an open problem. We first show that saddle points in neural networks can be divided into two types, among which the Type-II saddles are especially difficult to escape from because the gradient noise vanishes at the saddle. The dynamics of SGD around these saddles are thus to leading order described by a random matrix product process, and it is thus natural to study the dynamics of SGD around these saddles using the notion of probabilistic stability and the related Lyapunov exponent. Theoretically, we link the study of SGD dynamics to well-known concepts in ergodic theory, which we leverage to show that saddle points can be either attractive or repulsive for SGD, and its dynamics can be classified into four different phases, depending on the signal-to-noise ratio in the gradient close to the saddle.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to understand and characterize the dynamic behavior of Stochastic Gradient Descent (SGD) near saddle points during the optimization process in neural networks. Specifically, the paper focuses on the fact that saddle points can be divided into two types, and among them, Type - II saddle points are difficult for SGD to escape from because the gradient noise disappears at the saddle points. Therefore, the main contributions of the paper are as follows: 1. Propose to divide the saddle points in neural networks into two types: Type - I saddle points (where the gradient noise does not disappear at the saddle points) and Type - II saddle points (where the gradient noise disappears at the saddle points). 2. Propose to use probabilistic stability and Lyapunov exponent to study the dynamic behavior of Type - II saddle points, which introduces the concepts of traditional dynamic system control theory and ergodic theory into the study of SGD. 3. Through probabilistic stability analysis, it is proved that when approaching Type - II saddle points, SGD has at least four different learning stages, which is of great significance for understanding the initialization of neural networks. ### Abstract of the paper The paper first points out that saddle points in neural networks can be divided into two types, and Type - II saddle points are particularly difficult to escape from because the gradient noise disappears at the saddle points. The dynamic behavior of SGD near these saddle points can be described by the stochastic matrix product process, so it is natural to use probabilistic stability and the related Lyapunov exponent to study the dynamics of these saddle points. Theoretically, the paper links the dynamic study of SGD with the concepts in ergodic theory, and proves that saddle points can be attractive or repulsive for SGD, and their dynamics can be divided into four different stages according to the signal - to - noise ratio near the gradient. ### Main contributions 1. **Saddle point classification**: The paper proposes to divide the saddle points in neural networks into two types, and Type - II saddle points are proved to be difficult to escape from. 2. **Probabilistic stability and Lyapunov exponent**: The paper proposes to use probabilistic stability and Lyapunov exponent to study the attractiveness of Type - II saddle points, which applies the control theory of traditional dynamic systems and ergodic theory to the study of SGD. 3. **Learning stages**: Through probabilistic stability analysis, the paper shows that when approaching Type - II saddle points, SGD has at least four different learning stages, which is of great significance for understanding the initialization of neural networks. ### Research background Saddle points are widely present in the loss functions of neural networks. Understanding the behavior of SGD near saddle points is an important issue in the theory of deep learning. The paper pays special attention to Type - II saddle points. These saddle points make it difficult for SGD to escape from because the gradient noise disappears at the saddle points. By introducing probabilistic stability and Lyapunov exponent, the paper provides a new method to study the dynamic behavior of these saddle points. ### Experimental results The paper verifies the validity of the theoretical analysis through experiments. The experimental results show that SGD can exhibit different behaviors under different learning rates, including correct learning, incorrect learning, convergence to low - rank saddle points, and complete instability. These results are consistent with the theoretical predictions and further confirm the importance of the probabilistic stability framework in understanding the dynamic behavior of SGD. In conclusion, by introducing probabilistic stability and Lyapunov exponent, this paper provides a new perspective for understanding the behavior of SGD near saddle points in neural networks, which is of great significance for improving deep - learning algorithms and optimization strategies.

Type-II Saddles and Probabilistic Stability of Stochastic Gradient Descent

Dynamic of Stochastic Gradient Descent with State-Dependent Noise

Almost Sure Saddle Avoidance of Stochastic Gradient Methods without the Bounded Gradient Assumption

High Probability Guarantees for Nonconvex Stochastic Gradient Descent with Heavy Tails.

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects

Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization

Beyond the Edge of Stability via Two-step Gradient Updates

Stochastic Subgradient Descent Escapes Active Strict Saddles on Weakly Convex Functions

A Central Limit Theorem for Algorithmic Estimator of Saddle Point

Stochastic Differential Equations models for Least-Squares Stochastic Gradient Descent

Dealing with unbounded gradients in stochastic saddle-point optimization

A Deterministic Gradient-Based Approach to Avoid Saddle Points

Revisiting the Noise Model of Stochastic Gradient Descent

Fine-Grained Analysis of Stability and Generalization for Stochastic Gradient Descent

A Theoretical Analysis of Noise Geometry in Stochastic Gradient Descent

Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis

A Precise Characterization of SGD Stability Using Loss Surface Geometry

Characterizing Dynamical Stability of Stochastic Gradient Descent in Overparameterized Learning

Escaping Saddle Points with Stochastically Controlled Stochastic Gradient Methods

Stochastic Gradient and Langevin Processes

Exact Mean Square Linear Stability Analysis for SGD