Abstract:Entropy regularization is an efficient technique for encouraging exploration and preventing a premature convergence of (vanilla) policy gradient methods in reinforcement learning (RL). However, the theoretical understanding of entropy-regularized RL algorithms has been limited. In this paper, we revisit the classical entropy regularized policy gradient methods with the soft-max policy parametrization, whose convergence has so far only been established assuming access to exact gradient oracles. To go beyond this scenario, we propose the first set of (nearly) unbiased stochastic policy gradient estimators with trajectory-level entropy regularization, with one being an unbiased visitation measure-based estimator and the other one being a nearly unbiased yet more practical trajectory-based estimator. We prove that although the estimators themselves are unbounded in general due to the additional logarithmic policy rewards introduced by the entropy term, the variances are uniformly bounded. We then propose a two-phase stochastic policy gradient (PG) algorithm that uses a large batch size in the first phase to overcome the challenge of the stochastic approximation due to the non-coercive landscape, and uses a small batch size in the second phase by leveraging the curvature information around the optimal policy. We establish a global optimality convergence result and a sample complexity of $\widetilde{\mathcal{O}}(\frac{1}{\epsilon^2})$ for the proposed algorithm. Our result is the first global convergence and sample complexity results for the stochastic entropy-regularized vanilla PG method.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: in Reinforcement Learning (RL), how to ensure that the Entropy Regularized Policy Gradient (ERPG) can converge globally and have low sample complexity when using stochastic gradient estimation. Specifically, the paper focuses on whether the Entropy Regularized Policy Gradient method can maintain the geometric advantages in the exact gradient setting and achieve the convergence of the global optimal solution under the stochastic gradient setting. ### Main Problems 1. **Global Optimal Convergence**: How to prove that the Entropy Regularized Policy Gradient method can converge globally to the optimal solution under the stochastic gradient setting. 2. **Sample Complexity**: How to ensure that the sample complexity of the Entropy Regularized Policy Gradient method is low under the stochastic gradient setting, with the specific goal of achieving a sample complexity of $\tilde{O}(1/\epsilon^2)$. ### Background - **Entropy Regularization**: Entropy regularization is a commonly used technique for encouraging exploration and preventing premature convergence. It achieves this by adding an entropy term to the objective function. - **Exact Gradient Setting**: In the exact gradient setting, the Entropy Regularized Policy Gradient method has proven its linear convergence rate and global optimality. - **Stochastic Gradient Setting**: In practical applications, usually only a stochastic estimate of the policy gradient can be obtained, rather than the exact gradient. Therefore, it is necessary to study the performance of the Entropy Regularized Policy Gradient method under the stochastic gradient setting. ### Paper Contributions 1. **Proposing New Stochastic Gradient Estimators**: - Proposed an unbiased access - measure - based stochastic gradient estimator. - Proposed an almost unbiased but more practical trajectory - based stochastic gradient estimator. - Proved that although these estimators are unbounded in general, their variances are bounded. 2. **Two - Stage Stochastic Policy Gradient Algorithm**: - In the first stage, a larger batch size is used to overcome the challenges brought by the non - coercive landscape. - In the second stage, the curvature information near the optimal policy is utilized, and a smaller batch size is used. - Proved the global optimal convergence and the sample complexity of $\tilde{O}(1/\epsilon^2)$ of this algorithm. 3. **Theoretical Analysis**: - Proved the global optimal convergence and sample complexity of the Entropy Regularized Policy Gradient method under the stochastic gradient setting. - This is the first theoretical analysis of the global optimal convergence and sample complexity of the basic Entropy Regularized Policy Gradient methods (such as REINFORCE and its variants) under the stochastic gradient setting. ### Related Work - **Exploration and Robustness**: Entropy regularization performs well in improving exploration and robustness. - **Other Methods**: Some other methods, such as Actor - Critic, Q - learning, and Trust - Region Policy Optimization, also use entropy regularization. - **Existing Results**: Existing theoretical results mainly focus on the exact gradient setting, and relatively few studies have been conducted on the global optimal convergence and sample complexity under the stochastic gradient setting. ### Conclusion By proposing new stochastic gradient estimators and a two - stage algorithm, the paper successfully solves the problems of global optimal convergence and sample complexity of the Entropy Regularized Policy Gradient method under the stochastic gradient setting, filling this gap in this field.

Beyond Exact Gradients: Convergence of Stochastic Soft-Max Policy Gradient Methods with Entropy Regularization

Matryoshka Policy Gradient for Entropy-Regularized RL: Convergence and Global Optimality

Convergence of Policy Gradient for Entropy Regularized MDPs with Neural Network Approximation in the Mean-Field Regime

Linear Convergence of Entropy-Regularized Natural Policy Gradient with Linear Function Approximation

Linear Convergence of Independent Natural Policy Gradient in Games with Entropy Regularization

Convergence Rate of Primal-Dual Approach to Constrained Reinforcement Learning with Softmax Policy

Generalization Bounds for Stochastic Gradient Langevin Dynamics: A Unified View Via Information Leakage Analysis

Entropy annealing for policy mirror descent in continuous time and space

Stochastic Cubic-Regularized Policy Gradient Method

Approximate Newton policy gradient algorithms

Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization.

Essentially Sharp Estimates on the Entropy Regularization Error in Discrete Discounted Markov Decision Processes

Elementary Analysis of Policy Gradient Methods

Increasing Entropy to Boost Policy Gradient Performance on Personalization Tasks

Soft Policy Gradient Method for Maximum Entropy Deep Reinforcement Learning

Variational Policy Gradient Method for Reinforcement Learning with General Utilities

Stochastic Policy Gradient Methods: Improved Sample Complexity for Fisher-non-degenerate Policies

Convergence and Price of Anarchy Guarantees of the Softmax Policy Gradient in Markov Potential Games

Stochastic Convergence Results for Regularized Actor-Critic Methods

Towards Principled, Practical Policy Gradient for Bandits and Tabular MDPs

Optimal scheduling of entropy regulariser for continuous-time linear-quadratic reinforcement learning