Abstract:Stochastic gradient descent (SGD) has become the most attractive optimization method in training large-scale deep neural networks due to its simplicity, low computational cost in each updating step, and good performance. Standard excess risk bounds show that SGD only needs to take one pass over the training data and more passes could not help to improve the performance. Empirically, it has been observed that SGD taking more than one pass over the training data (multi-pass SGD) has much better excess risk bound performance than the SGD only taking one pass over the training data (one-pass SGD). However, it is not very clear that how to explain this phenomenon in theory. In this paper, we provide some theoretical evidences for explaining why multiple passes over the training data can help improve performance under certain circumstance. Specifically, we consider smooth risk minimization problems whose objective function is non-convex least squared loss. Under Polyak-Lojasiewicz (PL) condition, we establish faster convergence rate of excess risk bound for multi-pass SGD than that for one-pass SGD.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is why, when training deep neural networks, multi - round (multi - epoch) stochastic gradient descent (SGD) can improve the generalization performance of the model better than single - round SGD. Specifically, although standard theoretical analysis shows that SGD only needs to traverse the training data once to achieve the optimal effect, in practical applications, researchers have found that traversing the training data multiple times (i.e., multi - round SGD) can significantly reduce the test error, thereby improving the generalization ability of the model.
### Main contributions of the paper
1. **Theoretical explanation**: By introducing the Polyak - Łojasiewicz (PL) condition, the paper provides a theoretical basis for the faster convergence of multi - round SGD in non - convex optimization problems.
2. **Convergence rate**: Under the condition that the PL condition is satisfied, the paper proves that multi - round SGD can continue to improve the model performance in subsequent epochs, and its generalization error bound can reach $\tilde{O}\left(\frac{1}{n^2}\right)$, which is faster than $\tilde{O}\left(\frac{1}{n}\right)$ of single - round SGD.
3. **Combination of experimental observations and theory**: The paper combines the experience of actually training neural networks and theoretical analysis to explain why multi - round SGD can bring better performance improvement in practice.
### Specific problem description
The paper focuses on the following form of stochastic optimization problem:
\[
\min_{w \in \mathbb{R}^d} F_S(w) := \frac{1}{n} \sum_{i = 1}^n \ell(w; z_i),
\]
where $w$ is the model parameter, $S=\{z_1, \ldots, z_n\}$ is the training sample set sampled from the distribution $P$, and $\ell(w; z)$ is a non - negative smooth loss function. This problem occurs in most machine - learning - based optimization tasks, such as empirical risk minimization (ERM) and deep learning.
### Research motivation
Although a large number of studies have explored the convergence of SGD under different conditions, most of these studies have focused on convex optimization or under specific assumptions. For non - convex optimization problems such as deep learning, especially why multi - round SGD can perform better in practice, the existing theoretical explanations are insufficient. Therefore, this paper aims to fill this theoretical gap and provide a theoretical understanding of the superiority of multi - round SGD in non - convex optimization.
### Main conclusions
Through strict mathematical derivations, the paper proves that in non - convex optimization problems satisfying the PL condition, multi - round SGD can indeed continue to improve the model performance in subsequent epochs and gives specific generalization error bounds. This not only explains the phenomena in actual training but also provides theoretical support for the design of future deep - learning optimization algorithms.