Abstract:Entropy regularization is an efficient technique for encouraging exploration and preventing a premature convergence of (vanilla) policy gradient methods in reinforcement learning (RL). However, the theoretical understanding of entropy-regularized RL algorithms has been limited. In this paper, we revisit the classical entropy regularized policy gradient methods with the soft-max policy parametrization, whose convergence has so far only been established assuming access to exact gradient oracles. To go beyond this scenario, we propose the first set of (nearly) unbiased stochastic policy gradient estimators with trajectory-level entropy regularization, with one being an unbiased visitation measure-based estimator and the other one being a nearly unbiased yet more practical trajectory-based estimator. We prove that although the estimators themselves are unbounded in general due to the additional logarithmic policy rewards introduced by the entropy term, the variances are uniformly bounded. We then propose a two-phase stochastic policy gradient (PG) algorithm that uses a large batch size in the first phase to overcome the challenge of the stochastic approximation due to the non-coercive landscape, and uses a small batch size in the second phase by leveraging the curvature information around the optimal policy. We establish a global optimality convergence result and a sample complexity of $\widetilde{\mathcal{O}}(\frac{1}{\epsilon^2})$ for the proposed algorithm. Our result is the first global convergence and sample complexity results for the stochastic entropy-regularized vanilla PG method.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: in Reinforcement Learning (RL), how to ensure that the Entropy Regularized Policy Gradient (ERPG) can converge globally and have low sample complexity when using stochastic gradient estimation. Specifically, the paper focuses on whether the Entropy Regularized Policy Gradient method can maintain the geometric advantages in the exact gradient setting and achieve the convergence of the global optimal solution under the stochastic gradient setting.
### Main Problems
1. **Global Optimal Convergence**: How to prove that the Entropy Regularized Policy Gradient method can converge globally to the optimal solution under the stochastic gradient setting.
2. **Sample Complexity**: How to ensure that the sample complexity of the Entropy Regularized Policy Gradient method is low under the stochastic gradient setting, with the specific goal of achieving a sample complexity of \(\tilde{O}(1/\epsilon^2)\).
### Background
- **Entropy Regularization**: Entropy regularization is a commonly used technique for encouraging exploration and preventing premature convergence. It achieves this by adding an entropy term to the objective function.
- **Exact Gradient Setting**: In the exact gradient setting, the Entropy Regularized Policy Gradient method has proven its linear convergence rate and global optimality.
- **Stochastic Gradient Setting**: In practical applications, usually only a stochastic estimate of the policy gradient can be obtained, rather than the exact gradient. Therefore, it is necessary to study the performance of the Entropy Regularized Policy Gradient method under the stochastic gradient setting.
### Paper Contributions
1. **Proposing New Stochastic Gradient Estimators**:
- Proposed an unbiased access - measure - based stochastic gradient estimator.
- Proposed an almost unbiased but more practical trajectory - based stochastic gradient estimator.
- Proved that although these estimators are unbounded in general, their variances are bounded.
2. **Two - Stage Stochastic Policy Gradient Algorithm**:
- In the first stage, a larger batch size is used to overcome the challenges brought by the non - coercive landscape.
- In the second stage, the curvature information near the optimal policy is utilized, and a smaller batch size is used.
- Proved the global optimal convergence and the sample complexity of \(\tilde{O}(1/\epsilon^2)\) of this algorithm.
3. **Theoretical Analysis**:
- Proved the global optimal convergence and sample complexity of the Entropy Regularized Policy Gradient method under the stochastic gradient setting.
- This is the first theoretical analysis of the global optimal convergence and sample complexity of the basic Entropy Regularized Policy Gradient methods (such as REINFORCE and its variants) under the stochastic gradient setting.
### Related Work
- **Exploration and Robustness**: Entropy regularization performs well in improving exploration and robustness.
- **Other Methods**: Some other methods, such as Actor - Critic, Q - learning, and Trust - Region Policy Optimization, also use entropy regularization.
- **Existing Results**: Existing theoretical results mainly focus on the exact gradient setting, and relatively few studies have been conducted on the global optimal convergence and sample complexity under the stochastic gradient setting.
### Conclusion
By proposing new stochastic gradient estimators and a two - stage algorithm, the paper successfully solves the problems of global optimal convergence and sample complexity of the Entropy Regularized Policy Gradient method under the stochastic gradient setting, filling this gap in this field.