Non asymptotic analysis of Adaptive stochastic gradient algorithms and applications

Antoine Godichon-Baggioni,Pierre Tarrago
2023-03-01
Abstract:In stochastic optimization, a common tool to deal sequentially with large sample is to consider the well-known stochastic gradient algorithm. Nevertheless, since the stepsequence is the same for each direction, this can lead to bad results in practice in case of ill-conditionned problem. To overcome this, adaptive gradient algorithms such that Adagrad or Stochastic Newton algorithms should be prefered. This paper is devoted to the non asymptotic analyis of these adaptive gradient algorithms for strongly convex objective. All the theoretical results will be adapted to linear regression and regularized generalized linear model for both Adagrad and Stochastic Newton algorithms.
Optimization and Control,Probability,Statistics Theory,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the convergence rate problem of the adaptive stochastic gradient algorithm in non - asymptotic analysis under strongly convex objective functions. Specifically, the author focuses on how to improve the performance of the standard stochastic gradient descent algorithm through adaptive gradient algorithms (such as Adagrad and stochastic Newton algorithms) when facing ill - conditioned optimization problems. ### Background and Problem Description In stochastic optimization, a common tool for handling large - scale data sets is the stochastic gradient algorithm. However, since the step - size sequence is the same for each direction, this may lead to problems in practical applications, especially when dealing with ill - conditioned problems. To overcome this problem, adaptive gradient algorithms (such as Adagrad and stochastic Newton algorithms) were proposed, which can adjust the step - size according to different directions of the gradient. ### Research Objectives The main objectives of this paper are: 1. **Non - Asymptotic Analysis**: To study the non - asymptotic convergence rate of the adaptive stochastic gradient algorithm under strongly convex objective functions. 2. **Theoretical Results**: To provide theoretical results for Adagrad and stochastic Newton algorithms in linear regression and regularized generalized linear models. 3. **Practical Applications**: To apply the theoretical results to specific models and verify the effectiveness of the algorithms. ### Main Contributions 1. **Non - Asymptotic Convergence Rate**: Proposed the first convergence rate at which adaptive estimates may diverge under certain conditions but with controllable divergence. 2. **General Framework**: Established an unconstrained general framework for obtaining the convergence rates of stochastic Newton and Adagrad algorithms. 3. **Specific Applications**: Applied the theoretical results to linear regression and ridge generalized linear models and provided a detailed convergence rate analysis. ### Methods and Techniques - **Adaptive Gradient Algorithm**: Adjust the step - size in each coordinate direction by introducing a sequence \((A_n)\), where \(A_n\) is a random matrix. - **Assumption Conditions**: Introduced some assumption conditions, such as \((H1)\) and \((H2)\), to control the minimum and maximum eigenvalues of \(A_n\) and ensure that it has uniform second - order and fourth - order moments. - **Convergence Analysis**: Used mathematical tools such as Taylor expansion and conditional expectation to derive the non - asymptotic convergence rate of the algorithm. ### Conclusions - **Convergence Rate**: Under certain assumption conditions, the adaptive stochastic gradient algorithm has a good non - asymptotic convergence rate under strongly convex objective functions. - **Practical Effects**: In specific applications, such as linear regression and generalized linear models, the adaptive algorithms show better performance than the standard stochastic gradient descent algorithm. ### Formula Examples - **Objective Function**: \[ G(h)=\mathbb{E}[g(X, h)] \] - **Adaptive Gradient Update Rule**: \[ \theta_{n + 1}=\theta_n-\gamma_{n + 1}A_n\nabla_hg(X_{n + 1},\theta_n) \] - **Non - Asymptotic Convergence Rate**: \[ \mathbb{E}[V_n]\leq\exp\left(-c_\gamma\mu\lambda_0n^{1 - (\lambda+\gamma)}(1-\epsilon(n))\right)\left(K_1^{(1)}+K_1'^{(1)}\max_{1\leq k\leq n + 1}k^{\gamma-2\beta-\delta/2-(q/2 + 1)\lambda}\sqrt{v_k}\right)+K_2^{(1)}n^{-(\gamma-2\beta-\lambda)}+K_3^{(1)}\sqrt{v_{\lfloor n/2\rfloor}}n^{-(\delta+q\lambda)/2} \] Through these methods and techniques, the paper successfully solves the non - asymptotic convergence rate problem of the adaptive stochastic gradient algorithm under strongly convex objective functions and provides effective verification in practical applications.