Stefano Di Giovacchino,Desmond J. Higham,Konstantinos Zygalakis
Abstract:Stochastic optimization methods have been hugely successful in making large-scale optimization problems feasible when computing the full gradient is computationally prohibitive. Using the theory of modified equations for numerical integrators, we propose a class of stochastic differential equations that approximate the dynamics of general stochastic optimization methods more closely than the original gradient flow. Analyzing a modified stochastic differential equation can reveal qualitative insights about the associated optimization method. Here, we study mean-square stability of the modified equation in the case of stochastic coordinate descent.
What problem does this paper attempt to address?
This paper aims to solve the computational efficiency problem in large - scale optimization problems, especially in cases where computing the full gradient is computationally infeasible. Specifically, by using the modified equation theory of numerical integrators, the paper proposes a class of stochastic differential equations (SDEs), which describe the dynamic behavior of general stochastic optimization methods more accurately than the original gradient flow. Through the analysis of the modified stochastic differential equations, the qualitative characteristics of the relevant optimization methods can be revealed.
### Main contributions of the paper:
1. **Proposition 4.3**: A modified SDE suitable for general stochastic optimization iterations is proposed.
2. **Theorem 4.8**: Conditions are given to ensure the mean - square stable convergence of the stochastic coordinate descent method to the minimum.
### Paper structure:
- **Part 1**: Introduction, which introduces the research background and purpose.
- **Part 2**: Preliminary knowledge, which introduces the application of modified equations in ordinary differential equations (ODEs) and stochastic differential equations (SDEs).
- **Part 3**: Discusses the main ideas of stochastic optimization methods, and focuses on two cases: stochastic gradient descent and stochastic coordinate descent.
- **Part 4**: Presents the main results, including the modified SDE suitable for general stochastic optimization iterations and the mean - square stability conditions of the stochastic coordinate descent method.
- **Part 5**: Conclusion, which discusses possible directions for future research.
### Key technical points:
- **Modified equation**: Through the error analysis between the approximate solution generated by the numerical method and the solution of the original equation, a modified SDE is derived, which more accurately describes the dynamic behavior of the numerical method.
- **Mean - square stability**: Analyzes the mean - square stability of the modified SDE, especially for the stochastic coordinate descent method, and gives conditions to ensure its stable convergence.
### Specific formulas:
- **General form of the modified SDE**:
\[
d\tilde{X} = \left( -\nabla F(\tilde{X}) + h F_1(\tilde{X}) \right) dt + \sqrt{h} G_1(\tilde{X}) dW
\]
where \( F_1 \) and \( G_1 \) satisfy:
\[
F_1 = -\frac{1}{2} (\nabla \nabla F) \nabla F = -\frac{1}{4} \nabla \|\nabla F\|^2
\]
\[
G_1 = \sqrt{E[(\hat{\nabla} F - \nabla F)(\hat{\nabla} F - \nabla F)^T]}
\]
- **Modified equation of the stochastic coordinate descent method**:
\[
\Sigma(\tilde{X}) = d \sum_{i = 1}^d U_i (\nabla F(\tilde{X})) (\nabla F(\tilde{X}))^T U_i^T - (\nabla F(\tilde{X})) (\nabla F(\tilde{X}))^T
\]
- **Mean - square stability conditions**:
\[
E[\|X(t) - X^\star\|^2] \leq e^{-\alpha t} \|X(0) - X^\star\|^2
\]
where \(\alpha = 2\mu - hK + h(d - 1)L^2\), and \(X^\star\) is the unique minimum point of \(F\). If the step size satisfies \(h \leq \frac{2\mu}{(d - 1)L^2 - K}\), then:
\[
\lim_{t \to \infty} E[\|X(t) - X^\star\|^2] = 0
\]
### Summary:
This paper provides a new perspective for analyzing the dynamic behavior of stochastic optimization algorithms by introducing the method of modified equations, especially for the stochastic coordinate descent method, and gives conditions to ensure its stable convergence. These results not only help to understand the behavior of existing algorithms.