Abstract:Stochastic gradient descent (SGD) has become the most attractive optimization method in training large-scale deep neural networks due to its simplicity, low computational cost in each updating step, and good performance. Standard excess risk bounds show that SGD only needs to take one pass over the training data and more passes could not help to improve the performance. Empirically, it has been observed that SGD taking more than one pass over the training data (multi-pass SGD) has much better excess risk bound performance than the SGD only taking one pass over the training data (one-pass SGD). However, it is not very clear that how to explain this phenomenon in theory. In this paper, we provide some theoretical evidences for explaining why multiple passes over the training data can help improve performance under certain circumstance. Specifically, we consider smooth risk minimization problems whose objective function is non-convex least squared loss. Under Polyak-Lojasiewicz (PL) condition, we establish faster convergence rate of excess risk bound for multi-pass SGD than that for one-pass SGD.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is why, when training deep neural networks, multi - round (multi - epoch) stochastic gradient descent (SGD) can improve the generalization performance of the model better than single - round SGD. Specifically, although standard theoretical analysis shows that SGD only needs to traverse the training data once to achieve the optimal effect, in practical applications, researchers have found that traversing the training data multiple times (i.e., multi - round SGD) can significantly reduce the test error, thereby improving the generalization ability of the model. ### Main contributions of the paper 1. **Theoretical explanation**: By introducing the Polyak - Łojasiewicz (PL) condition, the paper provides a theoretical basis for the faster convergence of multi - round SGD in non - convex optimization problems. 2. **Convergence rate**: Under the condition that the PL condition is satisfied, the paper proves that multi - round SGD can continue to improve the model performance in subsequent epochs, and its generalization error bound can reach $\tilde{O}\left(\frac{1}{n^2}\right)$, which is faster than $\tilde{O}\left(\frac{1}{n}\right)$ of single - round SGD. 3. **Combination of experimental observations and theory**: The paper combines the experience of actually training neural networks and theoretical analysis to explain why multi - round SGD can bring better performance improvement in practice. ### Specific problem description The paper focuses on the following form of stochastic optimization problem: \[ \min_{w \in \mathbb{R}^d} F_S(w) := \frac{1}{n} \sum_{i = 1}^n \ell(w; z_i), \] where $w$ is the model parameter, $S=\{z_1, \ldots, z_n\}$ is the training sample set sampled from the distribution $P$, and $\ell(w; z)$ is a non - negative smooth loss function. This problem occurs in most machine - learning - based optimization tasks, such as empirical risk minimization (ERM) and deep learning. ### Research motivation Although a large number of studies have explored the convergence of SGD under different conditions, most of these studies have focused on convex optimization or under specific assumptions. For non - convex optimization problems such as deep learning, especially why multi - round SGD can perform better in practice, the existing theoretical explanations are insufficient. Therefore, this paper aims to fill this theoretical gap and provide a theoretical understanding of the superiority of multi - round SGD in non - convex optimization. ### Main conclusions Through strict mathematical derivations, the paper proves that in non - convex optimization problems satisfying the PL condition, multi - round SGD can indeed continue to improve the model performance in subsequent epochs and gives specific generalization error bounds. This not only explains the phenomena in actual training but also provides theoretical support for the design of future deep - learning optimization algorithms.

Why Does Multi-Epoch Training Help?

Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

Optimal Adaptive and Accelerated Stochastic Gradient Descent

Demystifying SGD with Doubly Stochastic Gradients

An Alternative View: When Does SGD Escape Local Minima?

Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions

Accelerated Gradient-free Neural Network Training by Multi-convex Alternating Optimization

The Benefits of Reusing Batches for Gradient Descent in Two-Layer Networks: Breaking the Curse of Information and Leap Exponents

The Optimality of (Accelerated) SGD for High-Dimensional Quadratic Optimization

On the Generalization of Stochastic Gradient Descent with Momentum

Dynamic of Stochastic Gradient Descent with State-Dependent Noise

Stochastic normalized gradient descent with momentum for large-batch training

Asynchronous Accelerated Stochastic Gradient Descent.

Stochastic Gradient Descent with Biased but Consistent Gradient Estimators

Gradient Descent Optimization in Deep Learning Model Training Based on Multistage and Method Combination Strategy

Risk Bounds of Accelerated SGD for Overparameterized Linear Regression

Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks

Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization

Large Stepsize Gradient Descent for Non-Homogeneous Two-Layer Networks: Margin Improvement and Fast Optimization

When and Why Momentum Accelerates SGD:An Empirical Study

Phylogenetic relationships within the lizard clade Xantusiidae: using trees and divergence times to address evolutionary questions at multiple levels.