On the Convergence of Discounted Policy Gradient Methods

Chris Nota
DOI: https://doi.org/10.48550/arXiv.2212.14066
2023-01-09
Abstract:Many popular policy gradient methods for reinforcement learning follow a biased approximation of the policy gradient known as the discounted approximation. While it has been shown that the discounted approximation of the policy gradient is not the gradient of any objective function, little else is known about its convergence behavior or properties. In this paper, we show that if the discounted approximation is followed such that the discount factor is increased slowly at a rate related to a decreasing learning rate, the resulting method recovers the standard guarantees of gradient ascent on the undiscounted objective.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is about the convergence of Discounted Policy Gradient Methods. Specifically, the paper explores whether the widely - used Discounted Policy Gradient Methods in reinforcement learning can converge to the optimal policy, although the discount approximation in these methods is not the true gradient of any objective function. ### Main Problems 1. **Properties of Discount Approximation**: Discounted Policy Gradient Methods usually use a direction called "discount approximation" to update policy parameters. However, this approximation is not the gradient of any objective function and introduces bias. 2. **Convergence Guarantee**: Although the discount approximation introduces bias, under certain conditions, can the policy gradient method based on discount approximation still converge to the optimal policy by appropriately adjusting the discount factor and the learning rate? ### Research Motivation - **Bias - Variance Trade - off**: Although the discount approximation introduces bias, it can significantly reduce the variance of the estimated value. Therefore, it is very important to understand the specific impact of this bias and its impact on the convergence of the algorithm. - **Theoretical Basis**: There is relatively little theoretical analysis of discount approximation in the existing literature, especially the lack of research on its convergence behavior. This paper aims to fill this gap and provide strict theoretical proof. ### Solutions The main contributions of the paper include: 1. **Quantification of Bias**: The author proves that the bias of discount approximation can be accurately calculated, and its magnitude is related to the value of the discount factor \( \gamma \). Specifically, the upper bound of the bias is proportional to \( 1-\gamma \). 2. **Convergence Conditions**: By applying the standard gradient method error convergence results, the author shows that if the discount factor \( \gamma \) increases slowly at a speed related to the decay rate of the learning rate, the policy gradient method based on discount approximation can restore the convergence guarantee of the standard gradient ascent method and finally converge to the local optimal policy. ### Formula Summary - **Direction of Discount Approximation**: \[ \hat{\nabla}(\theta, \gamma) := \mathbb{E}\left[\sum_{t = 0}^{T - 1} Q^{\pi_\theta}_\gamma(S_t, A_t) \frac{\partial}{\partial \theta} \ln \pi_\theta(S_t, A_t)\mid\pi=\pi_\theta\right] \] - **Upper Bound of Bias**: \[ \left\|\sum_{s\in S} V^{\pi_\theta}_\gamma(s) \frac{\partial}{\partial \theta} d^{\pi_\theta}_\gamma(s)\right\| \leq (1 - \gamma)L_e \] where \( L_e \) is a constant. ### Conclusion The paper proves through strict mathematical derivation that under appropriate conditions, the Discounted Policy Gradient Methods can converge to the optimal policy, which provides a solid theoretical basis for algorithm design in practical applications.