Abstract:Many popular policy gradient methods for reinforcement learning follow a biased approximation of the policy gradient known as the discounted approximation. While it has been shown that the discounted approximation of the policy gradient is not the gradient of any objective function, little else is known about its convergence behavior or properties. In this paper, we show that if the discounted approximation is followed such that the discount factor is increased slowly at a rate related to a decreasing learning rate, the resulting method recovers the standard guarantees of gradient ascent on the undiscounted objective.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is about the convergence of Discounted Policy Gradient Methods. Specifically, the paper explores whether the widely - used Discounted Policy Gradient Methods in reinforcement learning can converge to the optimal policy, although the discount approximation in these methods is not the true gradient of any objective function. ### Main Problems 1. **Properties of Discount Approximation**: Discounted Policy Gradient Methods usually use a direction called "discount approximation" to update policy parameters. However, this approximation is not the gradient of any objective function and introduces bias. 2. **Convergence Guarantee**: Although the discount approximation introduces bias, under certain conditions, can the policy gradient method based on discount approximation still converge to the optimal policy by appropriately adjusting the discount factor and the learning rate? ### Research Motivation - **Bias - Variance Trade - off**: Although the discount approximation introduces bias, it can significantly reduce the variance of the estimated value. Therefore, it is very important to understand the specific impact of this bias and its impact on the convergence of the algorithm. - **Theoretical Basis**: There is relatively little theoretical analysis of discount approximation in the existing literature, especially the lack of research on its convergence behavior. This paper aims to fill this gap and provide strict theoretical proof. ### Solutions The main contributions of the paper include: 1. **Quantification of Bias**: The author proves that the bias of discount approximation can be accurately calculated, and its magnitude is related to the value of the discount factor \( \gamma \). Specifically, the upper bound of the bias is proportional to \( 1-\gamma \). 2. **Convergence Conditions**: By applying the standard gradient method error convergence results, the author shows that if the discount factor \( \gamma \) increases slowly at a speed related to the decay rate of the learning rate, the policy gradient method based on discount approximation can restore the convergence guarantee of the standard gradient ascent method and finally converge to the local optimal policy. ### Formula Summary - **Direction of Discount Approximation**: \[ \hat{\nabla}(\theta, \gamma) := \mathbb{E}\left[\sum_{t = 0}^{T - 1} Q^{\pi_\theta}_\gamma(S_t, A_t) \frac{\partial}{\partial \theta} \ln \pi_\theta(S_t, A_t)\mid\pi=\pi_\theta\right] \] - **Upper Bound of Bias**: \[ \left\|\sum_{s\in S} V^{\pi_\theta}_\gamma(s) \frac{\partial}{\partial \theta} d^{\pi_\theta}_\gamma(s)\right\| \leq (1 - \gamma)L_e \] where \( L_e \) is a constant. ### Conclusion The paper proves through strict mathematical derivation that under appropriate conditions, the Discounted Policy Gradient Methods can converge to the optimal policy, which provides a solid theoretical basis for algorithm design in practical applications.

On the Convergence of Discounted Policy Gradient Methods

Global Convergence of Policy Gradient Methods in Reinforcement Learning, Games and Control

Convergence Rate of Primal-Dual Approach to Constrained Reinforcement Learning with Softmax Policy

Is the Policy Gradient a Gradient?

On the Second-Order Convergence of Biased Policy Gradient Algorithms

Elementary Analysis of Policy Gradient Methods

A nearly Blackwell-optimal policy gradient method

Accelerated Policy Gradient: On the Convergence Rates of the Nesterov Momentum for Reinforcement Learning

Convergence Rates of Accelerated Markov Gradient Descent with Applications in Reinforcement Learning

Strongly-polynomial time and validation analysis of policy gradient methods

A Temporal-Difference Approach to Policy Gradient Estimation

Theoretical Guarantees of Fictitious Discount Algorithms for Episodic Reinforcement Learning and Global Convergence of Policy Gradient Methods

On the Convergence of Projected Policy Gradient for Any Constant Step Sizes

Convergence of Policy Gradient for Stochastic Linear-Quadratic Control Problem in Infinite Horizon

When Do Off-Policy and On-Policy Policy Gradient Methods Align?

Correcting discount-factor mismatch in on-policy policy gradient methods

Deterministic Policy Gradients with General State Transitions

Reusing Historical Trajectories in Natural Policy Gradient via Importance Sampling: Convergence and Convergence Rate

Policy Gradient Methods for Risk-Sensitive Distributional Reinforcement Learning with Provable Convergence

Linear Convergence for Natural Policy Gradient with Log-linear Policy Parametrization