Abstract:Although stochastic gradient descent (SGD) method and its variants (e.g., stochastic momentum methods, AdaGrad) are the choice of algorithms for solving non-convex problems (especially deep learning), there still remain big gaps between the theory and the practice with many questions unresolved. For example, there is still a lack of theories of convergence for SGD and its variants that use stagewise step size and return an averaged solution in practice. In addition, theoretical insights of why adaptive step size of AdaGrad could improve non-adaptive step size of {\sgd} is still missing for non-convex optimization. This paper aims to address these questions and fill the gap between theory and practice. We propose a universal stagewise optimization framework for a broad family of {\bf non-smooth non-convex} (namely weakly convex) problems with the following key features: (i) at each stage any suitable stochastic convex optimization algorithms (e.g., SGD or AdaGrad) that return an averaged solution can be employed for minimizing a regularized convex problem; (ii) the step size is decreased in a stagewise manner; (iii) an averaged solution is returned as the final solution that is selected from all stagewise averaged solutions with sampling probabilities {\it increasing} as the stage number. Our theoretical results of stagewise AdaGrad exhibit its adaptive convergence, therefore shed insights on its faster convergence for problems with sparse stochastic gradients than stagewise SGD. To the best of our knowledge, these new results are the first of their kind for addressing the unresolved issues of existing theories mentioned earlier. Besides theoretical contributions, our empirical studies show that our stagewise SGD and ADAGRAD improve the generalization performance of existing variants/implementations of SGD and ADAGRAD.

A new non-convex framework to improve asymptotical knowledge on generic stochastic gradient descent

A stochastic use of the Kurdyka-Lojasiewicz property: Investigation of optimization algorithms behaviours in a non-convex differentiable framework

A new use of the Kurdyka-Lojasiewicz property to study asymptotic behaviours of some stochastic optimization algorithms in a non-convex differentiable framework

A KL-based Analysis Framework with Applications to Non-Descent Optimization Methods

Convergence Rates for Stochastic Approximation: Biased Noise with Unbounded Variance, and Applications

Almost Sure Convergence Rates Analysis and Saddle Avoidance of Stochastic Gradient Methods

Stochastic Gradient Descent Revisited

High Probability Guarantees for Nonconvex Stochastic Gradient Descent with Heavy Tails.

Stochastic Gradient Descent in the Viewpoint of Graduated Optimization

Convergence of Constant Step Stochastic Gradient Descent for Non-Smooth Non-Convex Functions

On Almost Sure Convergence Rates of Stochastic Gradient Methods.

On Almost Sure Convergence Rates of Stochastic Gradient Methods

Demystifying the Myths and Legends of Nonconvex Convergence of SGD

Decentralized Stochastic Subgradient Methods for Nonsmooth Nonconvex Optimization

Universal Stagewise Learning for Non-Convex Problems with Convergence on Averaged Solutions

Stability and convergence analysis of AdaGrad for non-convex optimization via novel stopping time-based techniques

Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions

Fine-Grained Analysis of Stability and Generalization for Stochastic Gradient Descent

Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes

On Convergence of Incremental Gradient for Non-Convex Smooth Functions

Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm