Abstract:Entropy regularization is an efficient technique for encouraging exploration and preventing a premature convergence of (vanilla) policy gradient methods in reinforcement learning (RL). However, the theoretical understanding of entropy-regularized RL algorithms has been limited. In this paper, we revisit the classical entropy regularized policy gradient methods with the soft-max policy parametrization, whose convergence has so far only been established assuming access to exact gradient oracles. To go beyond this scenario, we propose the first set of (nearly) unbiased stochastic policy gradient estimators with trajectory-level entropy regularization, with one being an unbiased visitation measure-based estimator and the other one being a nearly unbiased yet more practical trajectory-based estimator. We prove that although the estimators themselves are unbounded in general due to the additional logarithmic policy rewards introduced by the entropy term, the variances are uniformly bounded. We then propose a two-phase stochastic policy gradient (PG) algorithm that uses a large batch size in the first phase to overcome the challenge of the stochastic approximation due to the non-coercive landscape, and uses a small batch size in the second phase by leveraging the curvature information around the optimal policy. We establish a global optimality convergence result and a sample complexity of $\widetilde{\mathcal{O}}(\frac{1}{\epsilon^2})$ for the proposed algorithm. Our result is the first global convergence and sample complexity results for the stochastic entropy-regularized vanilla PG method.

Implicit Policy for Reinforcement Learning

An Entropy Regularization Free Mechanism for Policy-based Reinforcement Learning

Promoting Stochasticity for Expressive Policies Via a Simple and Efficient Regularization Method.

Implicit Two-Tower Policies

Off-Policy Maximum Entropy RL with Future State and Action Visitation Measures

Inverse Reinforcement Learning with Explicit Policy Estimates

Soft Policy Gradient Method for Maximum Entropy Deep Reinforcement Learning

Increasing Entropy to Boost Policy Gradient Performance on Personalization Tasks

Examining Policy Entropy of Reinforcement Learning Agents for Personalization Tasks

Predictable Reinforcement Learning Dynamics through Entropy Rate Minimization

Off-Policy Actor-Critic in an Ensemble: Achieving Maximum General Entropy and Effective Environment Exploration in Deep Reinforcement Learning

Privacy-Constrained Policies via Mutual Information Regularized Policy Gradients

A Regularized Approach to Sparse Optimal Policy in Reinforcement Learning.

Discretizing Continuous Action Space with Unimodal Probability Distributions for On-Policy Reinforcement Learning

Policy Gradient Algorithms Implicitly Optimize by Continuation

Fast Policy Learning for Linear Quadratic Control with Entropy Regularization

A Maximum Divergence Approach to Optimal Policy in Deep Reinforcement Learning

Policy Representation via Diffusion Probability Model for Reinforcement Learning

Representation-Driven Reinforcement Learning

Implicitly Regularized RL with Implicit Q-Values

Beyond Exact Gradients: Convergence of Stochastic Soft-Max Policy Gradient Methods with Entropy Regularization