Abstract:This paper studies the asymptotic behavior of the constant step Stochastic Gradient Descent for the minimization of an unknown function, defined as the expectation of a non convex, non smooth, locally Lipschitz random function. As the gradient may not exist, it is replaced by a certain operator: a reasonable choice is to use an element of the Clarke subdifferential of the random function; another choice is the output of the celebrated backpropagation algorithm, which is popular amongst practioners, and whose properties have recently been studied by Bolte and Pauwels. Since the expectation of the chosen operator is not in general an element of the Clarke subdifferential of the mean function, it has been assumed in the literature that an oracle of the Clarke subdifferential of the mean function is available. As a first result, it is shown in this paper that such an oracle is not needed for almost all initialization points of the algorithm. Next, in the small step size regime, it is shown that the interpolated trajectory of the algorithm converges in probability (in the compact convergence sense) towards the set of solutions of a particular differential inclusion: the subgradient flow. Finally, viewing the iterates as a Markov chain whose transition kernel is indexed by the step size, it is shown that the invariant distribution of the kernel converge weakly to the set of invariant distribution of this differential inclusion as the step size tends to zero. These results show that when the step size is small, with large probability, the iterates eventually lie in a neighborhood of the critical points of the mean function.

Provably Faster Gradient Descent via Long Steps

Provably Faster Gradient Descent via Long Steps

Accelerated Gradient Descent via Long Steps

Accelerated Objective Gap and Gradient Norm Convergence for Gradient Descent via Long Steps

Anytime Acceleration of Gradient Descent

Accelerated Gradient Descent by Concatenation of Stepsize Schedules

Accelerating Proximal Gradient Descent via Silver Stepsizes

Gradient descent with adaptive stepsize converges (nearly) linearly under fourth-order growth

A Strengthened Conjecture on the Minimax Optimal Constant Stepsize for Gradient Descent

Composing Optimized Stepsize Schedules for Gradient Descent

Convergence of Constant Step Stochastic Gradient Descent for Non-Smooth Non-Convex Functions

Toward a Unified Theory of Gradient Descent under Generalized Smoothness

New Gradient Methods with Adaptive Stepsizes by Approximate Models

New Stepsizes for the Gradient Method.

New stepsizes for the gradient method

Exact worst-case convergence rates of gradient descent: a complete analysis for all constant stepsizes over nonconvex and convex functions

Stochastic gradient descent algorithms for strongly convex functions at O(1/T) convergence rates

Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency

Linear Convergence Rate in Convex Setup is Possible! Gradient Descent Method Variants under $(L_0,L_1)$-Smoothness

Open Problem: Anytime Convergence Rate of Gradient Descent

Directional Smoothness and Gradient Methods: Convergence and Adaptivity