Composing Optimized Stepsize Schedules for Gradient Descent

Benjamin Grimmer,Kevin Shu,Alex L. Wang
2024-10-22
Abstract:Recent works by Altschuler and Parrilo and the authors have shown that it is possible to accelerate the convergence of gradient descent on smooth convex functions, even without momentum, just by picking special stepsizes. In this paper, we provide a general theory for composing stepsize schedules capturing all recent advances in this area and more. We propose three notions of ``composable'' stepsize schedules with elementary associated composition operations for combining them. From these operations, in addition to recovering recent works, we construct three highly optimized sequences of stepsize schedules. We first construct optimized stepsize schedules of every length generalizing the exponentially spaced silver stepsizes. We then construct highly optimized stepsizes schedules for minimizing final objective gap or gradient norm, improving on prior rates by constants and, more importantly, matching or beating the numerically computed minimax optimal schedules. We conjecture these schedules are in fact minimax (information theoretic) optimal. Several novel tertiary results follow from our theory including recovery of the recent dynamic gradient norm minimizing short stepsizes and extending them to objective gap minimization.
Optimization and Control
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to accelerate the convergence rate of the gradient descent algorithm on smooth convex functions through carefully designed stepsize schedules. Specifically, the author proposes a general theory for combining stepsize schedules to capture all the latest progress in this field, and presents three concepts of "composable" stepsize schedules and their corresponding combination operations. Through these operations, the author constructs three highly optimized stepsize schedule sequences, aiming to minimize the final objective gap or gradient norm, thereby improving the previous convergence rate, and in some cases matching or exceeding the previously optimal schedules obtained through numerical calculations. ### Main contributions: 1. **Propose composable stepsize schedules**: Define three types of composable stepsize schedules (f - composable, g - composable, and s - composable), and introduce the corresponding combination operations (f - join, g - join, and s - join). 2. **Recover and analyze existing work**: Using these combination operations, the author is able to recover and analyze multiple stepsize patterns in recent literature, including the fractal stepsize patterns proposed by Altschuler and Parrilo [2] and Grimmer et al. [5], 25 numerically optimal stepsize patterns calculated by Gupta et al. [7], and some other dynamic stepsize patterns. 3. **Calculate the optimal basic stepsize schedules**: Through the dynamic programming method, the author shows how to easily calculate the optimal basic stepsize schedules (Optimal Basic Schedules, OBS) of any length. In particular, the OBS - S family of stepsize schedules generalizes the silver stepsize schedule defined in [2] for all lengths \(n\). 4. **Optimality conjecture**: The author conjectures that the OBS - F pattern has an optimal convergence rate in the worst - case scenario, and the OBS - G pattern is also optimal in minimizing the final gradient norm. ### Formula summary: - **Definition of f - composable**: \[ f(x_n)-f(x^*) \leq \eta\left\|x_0 - x^*\right\|^2/2 \] where, \[ \eta=\frac{1}{1 + 2\sum_{i = 0}^{n-1}h_i}=\prod_{i = 0}^{n-1}(h_i - 1)^2 \] - **Definition of g - composable**: \[ \frac{1}{2}\left\|\nabla f(x_n)\right\|^2 \leq \eta(f(x_0)-f(x^*)) \] where, \[ \eta=\frac{1}{1 + 2\sum_{i = 0}^{n-1}h_i}=\prod_{i = 0}^{n-1}(h_i - 1)^2 \] - **Definition of s - composable**: \[ \left(1-\eta\right)\frac{1}{2}\left\|\nabla f(x_n)\right\|^2+\eta^2\frac{1}{2}\left\|x_n - x^*\right\|^2+(\eta-\eta^2)(f(x_n)-f(x^*)) \leq \eta^2\frac{1}{2}\left\|x_0 - x^*\right\|^2 \] where, \[ \eta=\frac{1}{1+\sum_{i = 0}^{n-1}h_i}=\prod_{i = 0}^{n-1}(h_i - 1) \] ### Conclusion: This paper provides a systematic method for designing and analyzing stepsize schedules in the gradient descent algorithm by introducing composable stepsize schedules and their combination operations, thereby significantly improving...