Abstract:Stochastic Gradient Descent (SGD) with adaptive steps is now widely used for training deep neural networks. Most theoretical results assume access to unbiased gradient estimators, which is not the case in several recent deep learning and reinforcement learning applications that use Monte Carlo methods. This paper provides a comprehensive non-asymptotic analysis of SGD with biased gradients and adaptive steps for convex and non-convex smooth functions. Our study incorporates time-dependent bias and emphasizes the importance of controlling the bias and Mean Squared Error (MSE) of the gradient estimator. In particular, we establish that Adagrad and RMSProp with biased gradients converge to critical points for smooth non-convex functions at a rate similar to existing results in the literature for the unbiased case. Finally, we provide experimental results using Variational Autoenconders (VAE) that illustrate our convergence results and show how the effect of bias can be reduced by appropriate hyperparameter tuning.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to provide non - asymptotic convergence analysis when using the Stochastic Gradient Descent (SGD) algorithm with a biased gradient estimator and an adaptive step size. Specifically, the paper focuses on how to ensure the effectiveness and convergence speed of the SGD algorithm in deep learning and reinforcement learning applications when the gradient estimator is biased due to the use of the Monte Carlo method. ### Background and Motivation - **Limitations of Standard SGD**: The traditional SGD algorithm assumes that the gradient estimator is unbiased, that is, its expected value is equal to the true gradient. However, in many practical applications, such as zero - order optimization methods, Generative Adversarial Networks (GAN), Q - learning, policy gradients, etc., the gradient estimator is often biased. - **Importance of Adaptive Step Size**: Modern SGD variants (such as Adagrad, RMSProp) use adaptive step sizes to improve algorithm performance, but existing theoretical analyses usually assume that the gradient estimator is unbiased. ### Research Objectives - **Non - Asymptotic Convergence Analysis**: The paper aims to provide non - asymptotic convergence analysis of the SGD algorithm with a biased gradient estimator and an adaptive step size, especially its performance on convex and non - convex smooth functions. - **Controlling Bias and Mean Squared Error**: The research emphasizes the importance of controlling the bias (Bias) and mean squared error (MSE) of the gradient estimator and provides specific theoretical results. ### Main Contributions 1. **Theoretical Results**: - For non - convex smooth functions, the paper proves that the convergence speeds of Adagrad and RMSProp under a biased gradient are similar to the unbiased cases in existing literature, specifically \( O\left(\frac{\log n}{\sqrt{n}}+b_n\right) \), where \( b_n \) is a bias term related to the number of iterations. - For convex functions, the paper obtains an improved convergence speed of \( O\left(\frac{1}{\sqrt{n}}+b_n\right) \). 2. **Hyperparameter Tuning**: - Provides methods to effectively eliminate the bias term by appropriately adjusting hyperparameters, thereby further improving the convergence speed. 3. **Experimental Verification**: - Conducted experiments using Variational Auto - Encoders (VAE) to verify the theoretical results and showed how to reduce the impact of bias by adjusting hyperparameters. ### Conclusion The paper, through rigorous theoretical analysis and experimental verification, proves the effectiveness and convergence of the SGD algorithm with a biased gradient estimator and an adaptive step size in practical applications. These results are of great significance for understanding optimization algorithms in modern deep learning and reinforcement learning.

Non-asymptotic Analysis of Biased Adaptive Stochastic Approximation

Stochastic Gradient Descent with Biased but Consistent Gradient Estimators

Convergence Rates for Stochastic Approximation: Biased Noise with Unbounded Variance, and Applications

Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions

Convergence Analysis of Stochastic Gradient Descent with MCMC Estimators

Towards Noise-adaptive, Problem-adaptive (Accelerated) Stochastic Gradient Descent

On Biased Stochastic Gradient Estimation

Non asymptotic analysis of Adaptive stochastic gradient algorithms and applications

Stochastic Gradient Descent Revisited

Two Sides of One Coin: the Limits of Untuned SGD and the Power of Adaptive Methods

Novel Convergence Results of Adaptive Stochastic Gradient Descents

Stochastic Gradient Descent as Approximate Bayesian Inference

Nonasymptotic Analysis of Stochastic Gradient Descent with the Richardson-Romberg Extrapolation

Bias in Motion: Theoretical Insights into the Dynamics of Bias in SGD Training

Optimal Adaptive and Accelerated Stochastic Gradient Descent

Stochastic Approximate Gradient Descent via the Langevin Algorithm

High Probability Guarantees for Nonconvex Stochastic Gradient Descent with Heavy Tails.

Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates

Accelerated stochastic approximation with state-dependent noise

Stochastic Gradient Descent in the Viewpoint of Graduated Optimization

Almost Sure Saddle Avoidance of Stochastic Gradient Methods without the Bounded Gradient Assumption