Abstract:We study the use of gradient descent with backtracking line search (GD-BLS) to solve the noisy optimization problem $\theta_\star:=\mathrm{argmin}_{\theta\in\mathbb{R}^d} \mathbb{E}[f(\theta,Z)]$, imposing that the function $F(\theta):=\mathbb{E}[f(\theta,Z)]$ is strictly convex but not necessarily $L$-smooth. Assuming that $\mathbb{E}[\|\nabla_\theta f(\theta_\star,Z)\|^2]<\infty$, we first prove that sample average approximation based on GD-BLS allows to estimate $\theta_\star$ with an error of size $\mathcal{O}_{\mathbb{P}}(B^{-0.25})$, where $B$ is the available computational budget. We then show that we can improve upon this rate by stopping the optimization process earlier when the gradient of the objective function is sufficiently close to zero, and use the residual computational budget to optimize, again with GD-BLS, a finer approximation of $F$. By iteratively applying this strategy $J$ times, we establish that we can estimate $\theta_\star$ with an error of size $\mathcal{O}_{\mathbb{P}}(B^{-\frac{1}{2}(1-\delta^{J})})$, where $\delta\in(1/2,1)$ is a user-specified parameter. More generally, we show that if $\mathbb{E}[\|\nabla_\theta f(\theta_\star,Z)\|^{1+\alpha}]<\infty$ for some known $\alpha\in (0,1]$ then this approach, which can be seen as a retrospective approximation algorithm with a fixed computational budget, allows to learn $\theta_\star$ with an error of size $\mathcal{O}_{\mathbb{P}}(B^{-\frac{\alpha}{1+\alpha}(1-\delta^{J})})$, where $\delta\in(2\alpha/(1+3\alpha),1)$ is a tuning parameter. Beyond knowing $\alpha$, achieving the aforementioned convergence rates do not require to tune the algorithms parameters according to the specific functions $F$ and $f$ at hand, and we exhibit a simple noisy optimization problem for which stochastic gradient is not guaranteed to converge while the algorithms discussed in this work are.

Improving the Convergence Rates of Forward Gradient Descent with Repeated Sampling

Convergence guarantees for forward gradient descent in the linear regression model

Lsh-sampling Breaks the Computation Chicken-and-egg Loop in Adaptive Stochastic Gradient Estimation

Aiming towards the minimizers: fast convergence of SGD for overparametrized problems

Gradient descent with adaptive stepsize converges (nearly) linearly under fourth-order growth

Carathéodory Sampling for Stochastic Gradient Descent

Stochastic Gradient Variance Reduction by Solving a Filtering Problem

Gradient Descent for Noisy Optimization

Asynchronous Accelerated Stochastic Gradient Descent.

Towards Noise-adaptive, Problem-adaptive (Accelerated) Stochastic Gradient Descent

Optimal Adaptive and Accelerated Stochastic Gradient Descent

Provably Faster Gradient Descent via Long Steps

Gradient Flossing: Improving Gradient Descent through Dynamic Control of Jacobians

Beyond the Edge of Stability via Two-step Gradient Updates

Demystifying SGD with Doubly Stochastic Gradients

Novel Convergence Results of Adaptive Stochastic Gradient Descents

On Faster Convergence of Scaled Sign Gradient Descent

Adaptive Step Sizes for Preconditioned Stochastic Gradient Descent

Sampling from Gaussian Process Posteriors using Stochastic Gradient Descent