Abstract:Here we develop variants of SGD (stochastic gradient descent) with an adaptive step size that make use of the sampled loss values. In particular, we focus on solving a finite sum-of-terms problem, also known as empirical risk minimization. We first detail an idealized adaptive method called $\texttt{SPS}_+$ that makes use of the sampled loss values and assumes knowledge of the sampled loss at optimality. This $\texttt{SPS}_+$ is a minor modification of the SPS (Stochastic Polyak Stepsize) method, where the step size is enforced to be positive. We then show that $\texttt{SPS}_+$ achieves the best known rates of convergence for SGD in the Lipschitz non-smooth. We then move onto to develop $\texttt{FUVAL}$, a variant of $\texttt{SPS}_+$ where the loss values at optimality are gradually learned, as opposed to being given. We give three viewpoints of $\texttt{FUVAL}$, as a projection based method, as a variant of the prox-linear method, and then as a particular online SGD method. We then present a convergence analysis of $\texttt{FUVAL}$ and experimental results. The shortcomings of our work is that the convergence analysis of $\texttt{FUVAL}$ shows no advantage over SGD. Another shortcomming is that currently only the full batch version of $\texttt{FUVAL}$ shows a minor advantages of GD (Gradient Descent) in terms of sensitivity to the step size. The stochastic version shows no clear advantage over SGD. We conjecture that large mini-batches are required to make $\texttt{FUVAL}$ competitive. Currently the new $\texttt{FUVAL}$ method studied in this paper does not offer any clear theoretical or practical advantage. We have chosen to make this draft available online nonetheless because of some of the analysis techniques we use, such as the non-smooth analysis of $\texttt{SPS}_+$, and also to show an apparently interesting approach that currently does not work.

Adaptive Proximal SGD Based on New Estimating Sequences for Sparser ERM

Making SGD Efficient by Reducing Projections: Guaranteed Optimal Rate for Strongly Convex Optimization

A Bregman Proximal Stochastic Gradient Method with Extrapolation for Nonconvex Nonsmooth Problems

Asynchronous Stochastic Proximal Optimization Algorithms with Variance Reduction

Asynchronous Stochastic Proximal Methods for Nonconvex Nonsmooth Optimization.

Stochastic Proximal Gradient Algorithm with Minibatches. Application to Large Scale Learning Models

Empirical Risk Minimization with Shuffled SGD: A Primal-Dual Perspective and Improved Bounds

Towards Noise-adaptive, Problem-adaptive (Accelerated) Stochastic Gradient Descent

A Minibatch Proximal Stochastic Recursive Gradient Algorithm Using a Trust-Region-Like Scheme and Barzilai–Borwein Stepsizes

Fast Rates of ERM and Stochastic Approximation: Adaptive to Error Bound Conditions

A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex Optimization.

Asynchronous Accelerated Stochastic Gradient Descent.

Accelerated gradient methods for sparse statistical learning with nonconvex penalties

Self-guided Evolution Strategies with Historical Estimated Gradients

Simple and Optimal Stochastic Gradient Methods for Nonsmooth Nonconvex Optimization

AGDA+: Proximal Alternating Gradient Descent Ascent Method With a Nonmonotone Adaptive Step-Size Search For Nonconvex Minimax Problems

Adaptive smoothing mini-batch stochastic accelerated gradient method for nonsmooth convex stochastic composite optimization

Adaptive Step Sizes for Preconditioned Stochastic Gradient Descent

Function Value Learning: Adaptive Learning Rates Based on the Polyak Stepsize and Function Splitting in ERM

A class of modified accelerated proximal gradient methods for nonsmooth and nonconvex minimization problems

A Hybrid Stochastic-Deterministic Minibatch Proximal Gradient Method for Efficient Optimization and Generalization.