Abstract:Stochastic Gradient Descent (SGD) with adaptive steps is now widely used for training deep neural networks. Most theoretical results assume access to unbiased gradient estimators, which is not the case in several recent deep learning and reinforcement learning applications that use Monte Carlo methods. This paper provides a comprehensive non-asymptotic analysis of SGD with biased gradients and adaptive steps for convex and non-convex smooth functions. Our study incorporates time-dependent bias and emphasizes the importance of controlling the bias and Mean Squared Error (MSE) of the gradient estimator. In particular, we establish that Adagrad and RMSProp with biased gradients converge to critical points for smooth non-convex functions at a rate similar to existing results in the literature for the unbiased case. Finally, we provide experimental results using Variational Autoenconders (VAE) that illustrate our convergence results and show how the effect of bias can be reduced by appropriate hyperparameter tuning.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to provide non - asymptotic convergence analysis when using the Stochastic Gradient Descent (SGD) algorithm with a biased gradient estimator and an adaptive step size. Specifically, the paper focuses on how to ensure the effectiveness and convergence speed of the SGD algorithm in deep learning and reinforcement learning applications when the gradient estimator is biased due to the use of the Monte Carlo method.
### Background and Motivation
- **Limitations of Standard SGD**: The traditional SGD algorithm assumes that the gradient estimator is unbiased, that is, its expected value is equal to the true gradient. However, in many practical applications, such as zero - order optimization methods, Generative Adversarial Networks (GAN), Q - learning, policy gradients, etc., the gradient estimator is often biased.
- **Importance of Adaptive Step Size**: Modern SGD variants (such as Adagrad, RMSProp) use adaptive step sizes to improve algorithm performance, but existing theoretical analyses usually assume that the gradient estimator is unbiased.
### Research Objectives
- **Non - Asymptotic Convergence Analysis**: The paper aims to provide non - asymptotic convergence analysis of the SGD algorithm with a biased gradient estimator and an adaptive step size, especially its performance on convex and non - convex smooth functions.
- **Controlling Bias and Mean Squared Error**: The research emphasizes the importance of controlling the bias (Bias) and mean squared error (MSE) of the gradient estimator and provides specific theoretical results.
### Main Contributions
1. **Theoretical Results**:
- For non - convex smooth functions, the paper proves that the convergence speeds of Adagrad and RMSProp under a biased gradient are similar to the unbiased cases in existing literature, specifically \( O\left(\frac{\log n}{\sqrt{n}}+b_n\right) \), where \( b_n \) is a bias term related to the number of iterations.
- For convex functions, the paper obtains an improved convergence speed of \( O\left(\frac{1}{\sqrt{n}}+b_n\right) \).
2. **Hyperparameter Tuning**:
- Provides methods to effectively eliminate the bias term by appropriately adjusting hyperparameters, thereby further improving the convergence speed.
3. **Experimental Verification**:
- Conducted experiments using Variational Auto - Encoders (VAE) to verify the theoretical results and showed how to reduce the impact of bias by adjusting hyperparameters.
### Conclusion
The paper, through rigorous theoretical analysis and experimental verification, proves the effectiveness and convergence of the SGD algorithm with a biased gradient estimator and an adaptive step size in practical applications. These results are of great significance for understanding optimization algorithms in modern deep learning and reinforcement learning.