Stagewise Accelerated Stochastic Gradient Methods for Nonconvex Optimization

Cui Jia,Zhuoxu Cui
DOI: https://doi.org/10.3390/math12111664
IF: 2.4
2024-05-27
Mathematics
Abstract:For large-scale optimization that covers a wide range of optimization problems encountered frequently in machine learning and deep neural networks, stochastic optimization has become one of the most used methods thanks to its low computational complexity. In machine learning and deep learning problems, nonconvex problems are common, while convex problems are rare. How to find the global minimum for nonconvex optimization and reduce the computational complexity are challenges. Inspired by the phenomenon that the stagewise stepsize tuning strategy can empirically improve the convergence speed in deep neural networks, we incorporate the stagewise stepsize tuning strategy into the iterative framework of Nesterov's acceleration- and variance reduction-based methods to reduce the computational complexity, i.e., the stagewise stepsize tuning strategy is incorporated into randomized stochastic accelerated gradient and stochastic variance-reduced gradient. The proposed methods are theoretically derived to reduce the complexity of the nonconvex and convex problems and improve the convergence rate of the frameworks, which have the complexity O(1/με) and O(1/με), respectively, where μ is the PL modulus and L is the Lipschitz constant. In the end, numerical experiments on large benchmark datasets validate well the competitiveness of the proposed methods.
mathematics
What problem does this paper attempt to address?
### The Problem Addressed by the Paper This paper aims to address the challenges of non-convex optimization in large-scale optimization problems. Specifically, the paper focuses on the following points: 1. **Reducing Computational Complexity**: In machine learning and deep neural networks, optimization problems are often non-convex, making it very difficult to find the global minimum. Traditional Stochastic Gradient Descent (SGD) methods, while having low computational complexity, converge slowly, especially when dealing with large-scale datasets. 2. **Improving Convergence Speed**: To accelerate the convergence of non-convex optimization problems, the paper introduces a staged step size adjustment strategy (SSTS). By combining Nesterov acceleration and variance reduction techniques, the paper proposes new optimization algorithms—Staged Accelerated Randomized Stochastic Gradient (S-RSAG) and Staged Accelerated Variance Reduced Gradient (S-SVRG). 3. **Theoretical Analysis and Experimental Validation**: The paper theoretically proves that the proposed algorithms have complexities of O(1/µϵ) and O(L/µϵ) for non-convex and convex optimization problems, respectively. It also conducts experimental validation on multiple benchmark datasets, demonstrating the effectiveness and competitiveness of these algorithms. ### Main Contributions of the Paper 1. **Staged Accelerated Randomized Stochastic Gradient (S-RSAG)**: - For non-convex optimization problems, the iteration complexity of S-RSAG is O(L/µϵ), significantly lower than the non-staged version RSAG's O(L^2/ϵ + L/ϵ^2). - For convex optimization problems, the iteration complexity of S-RSAG is O(1/µϵ), also superior to the non-staged version RSAG's O(L/√ϵ + 1/ϵ^2). 2. **Staged Accelerated Variance Reduced Gradient (S-SVRG)**: - For non-convex optimization problems, the iteration complexity of S-SVRG is O(Lm/(µ√ϵ)), significantly better than the non-staged version SVRG. - For convex optimization problems, the iteration complexity of S-SVRG is O(Lm/(µ^2√ϵ)), also superior to the non-staged version SVRG. 3. **Experimental Results**: - Experiments on datasets such as MNIST, CIFAR-10, REAL-SIM, and RCV1 validate the superior performance of S-RSAG and S-SVRG in terms of loss value, training accuracy, and testing accuracy. ### Summary By introducing a staged step size adjustment strategy and combining Nesterov acceleration and variance reduction techniques, this paper proposes two new optimization algorithms, S-RSAG and S-SVRG. These algorithms show significant performance improvements both theoretically and experimentally, particularly in handling large-scale non-convex optimization problems, effectively reducing computational complexity and improving convergence speed.