Abstract:We study the use of gradient descent with backtracking line search (GD-BLS) to solve the noisy optimization problem $\theta_\star:=\mathrm{argmin}_{\theta\in\mathbb{R}^d} \mathbb{E}[f(\theta,Z)]$, imposing that the function $F(\theta):=\mathbb{E}[f(\theta,Z)]$ is strictly convex but not necessarily $L$-smooth. Assuming that $\mathbb{E}[\|\nabla_\theta f(\theta_\star,Z)\|^2]<\infty$, we first prove that sample average approximation based on GD-BLS allows to estimate $\theta_\star$ with an error of size $\mathcal{O}_{\mathbb{P}}(B^{-0.25})$, where $B$ is the available computational budget. We then show that we can improve upon this rate by stopping the optimization process earlier when the gradient of the objective function is sufficiently close to zero, and use the residual computational budget to optimize, again with GD-BLS, a finer approximation of $F$. By iteratively applying this strategy $J$ times, we establish that we can estimate $\theta_\star$ with an error of size $\mathcal{O}_{\mathbb{P}}(B^{-\frac{1}{2}(1-\delta^{J})})$, where $\delta\in(1/2,1)$ is a user-specified parameter. More generally, we show that if $\mathbb{E}[\|\nabla_\theta f(\theta_\star,Z)\|^{1+\alpha}]<\infty$ for some known $\alpha\in (0,1]$ then this approach, which can be seen as a retrospective approximation algorithm with a fixed computational budget, allows to learn $\theta_\star$ with an error of size $\mathcal{O}_{\mathbb{P}}(B^{-\frac{\alpha}{1+\alpha}(1-\delta^{J})})$, where $\delta\in(2\alpha/(1+3\alpha),1)$ is a tuning parameter. Beyond knowing $\alpha$, achieving the aforementioned convergence rates do not require to tune the algorithms parameters according to the specific functions $F$ and $f$ at hand, and we exhibit a simple noisy optimization problem for which stochastic gradient is not guaranteed to converge while the algorithms discussed in this work are.

Greedy coordinate descent from the view of $\ell_1$-norm gradient descent

Accelerated Stochastic Greedy Coordinate Descent by Soft Thresholding Projection Onto Simplex

Scaling Up Coordinate Descent Algorithms for Large $\ell_1$ Regularization Problems

Efficiency of Coordinate Descent Methods For Structured Nonconvex Optimization

Greedy Double Subspaces Coordinate Descent Method for Solving Linear Least-Squares Problems

A Cyclic Coordinate Descent Method for Convex Optimization on Polytopes

Block Coordinate Descent Methods for Structured Nonconvex Optimization with Nonseparable Constraints: Optimality Conditions and Global Convergence

Analyzing and Improving Greedy 2-Coordinate Updates for Equality-Constrained Optimization via Steepest Descent in the 1-Norm

A Single-Loop Smoothed Gradient Descent-Ascent Algorithm for Nonconvex-Concave Min-Max Problems

Feature Clustering for Accelerating Parallel Coordinate Descent

Universal Gradient Descent Ascent Method for Nonconvex-Nonconcave Minimax Optimization

Shuffling Gradient Descent-Ascent with Variance Reduction for Nonconvex-Strongly Concave Smooth Minimax Problems

Gradient Descent for Noisy Optimization

Greedy Block Coordinate Descent for Large Scale Gaussian Process Regression

Coordinate Descent with Arbitrary Sampling I: Algorithms and Complexity

Randomness and Permutations in Coordinate Descent Methods

AGDA+: Proximal Alternating Gradient Descent Ascent Method With a Nonmonotone Adaptive Step-Size Search For Nonconvex Minimax Problems

Parallel Coordinate Descent Newton for Large-scale L1-Regularized Minimization.

On the Suboptimality of Proximal Gradient Descent for $\ell^{0}$ Sparse Approximation

A Stochastic GDA Method With Backtracking For Solving Nonconvex (Strongly) Concave Minimax Problems

A Flexible Coordinate Descent Method