Abstract:Although stochastic gradient descent (SGD) method and its variants (e.g., stochastic momentum methods, AdaGrad) are the choice of algorithms for solving non-convex problems (especially deep learning), there still remain big gaps between the theory and the practice with many questions unresolved. For example, there is still a lack of theories of convergence for SGD and its variants that use stagewise step size and return an averaged solution in practice. In addition, theoretical insights of why adaptive step size of AdaGrad could improve non-adaptive step size of {\sgd} is still missing for non-convex optimization. This paper aims to address these questions and fill the gap between theory and practice. We propose a universal stagewise optimization framework for a broad family of {\bf non-smooth non-convex} (namely weakly convex) problems with the following key features: (i) at each stage any suitable stochastic convex optimization algorithms (e.g., SGD or AdaGrad) that return an averaged solution can be employed for minimizing a regularized convex problem; (ii) the step size is decreased in a stagewise manner; (iii) an averaged solution is returned as the final solution that is selected from all stagewise averaged solutions with sampling probabilities {\it increasing} as the stage number. Our theoretical results of stagewise AdaGrad exhibit its adaptive convergence, therefore shed insights on its faster convergence for problems with sparse stochastic gradients than stagewise SGD. To the best of our knowledge, these new results are the first of their kind for addressing the unresolved issues of existing theories mentioned earlier. Besides theoretical contributions, our empirical studies show that our stagewise SGD and ADAGRAD improve the generalization performance of existing variants/implementations of SGD and ADAGRAD.

Relationship between Batch Size and Number of Steps Needed for Nonconvex Optimization of Stochastic Gradient Descent using Armijo Line Search

Iteration and stochastic first-order oracle complexities of stochastic gradient descent using constant and decaying learning rates

Batch Size Matters: A Diffusion Approximation Framework on Nonconvex Stochastic Gradient Descent.

The Number of Steps Needed for Nonconvex Optimization of a Deep Learning Optimizer is a Rational Function of Batch Size

The Impact of Local Geometry and Batch Size on Stochastic Gradient Descent for Nonconvex Problems

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Unlocking optimal batch size schedules using continuous-time control and perturbation theory

Using Stochastic Gradient Descent to Smooth Nonconvex Functions: Analysis of Implicit Graduated Optimization with Optimal Noise Scheduling

Effect of Random Learning Rate: Theoretical Analysis of SGD Dynamics in Non-Convex Optimization via Stationary Distribution

AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods

Universal Stagewise Learning for Non-Convex Problems with Convergence on Averaged Solutions

Increasing Both Batch Size and Learning Rate Accelerates Stochastic Gradient Descent

Why Does Large Batch Training Result in Poor Generalization? A Comprehensive Explanation and a Better Strategy from the Viewpoint of Stochastic Optimization

On the Diffusion Approximation of Nonconvex Stochastic Gradient Descent

Stagewise Enlargement of Batch Size for SGD-based Learning

Exact Mean Square Linear Stability Analysis for SGD

On the Convergence and Improvement of Stochastic Normalized Gradient Descent

Efficient mini-batch training for stochastic optimization

Mini-batch Algorithms with Barzilai-Borwein Update Step

Asgd: Stochastic Gradient Descent with Adaptive Batch Size for Every Parameter