Abstract:In recent years, various powerful policy gradient algorithms have been proposed in deep reinforcement learning. While all these algorithms build on the Policy Gradient Theorem, the specific design choices differ significantly across algorithms. We provide a holistic overview of on-policy policy gradient algorithms to facilitate the understanding of both their theoretical foundations and their practical implementations. In this overview, we include a detailed proof of the continuous version of the Policy Gradient Theorem, convergence results and a comprehensive discussion of practical algorithms. We compare the most prominent algorithms on continuous control environments and provide insights on the benefits of regularization. All code is available at

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is how to understand and implement algorithms based on Policy Gradients in deep reinforcement learning. Specifically, the author hopes to help readers understand the theoretical basis of these algorithms and their practical applications by providing a comprehensive review. The following are the specific problems that the paper attempts to solve: 1. **Theoretical Basis**: The paper proves in detail the continuous - version Policy Gradient Theorem, which is the basis of all Policy Gradient algorithms. Through this proof, the author hopes to provide solid theoretical support for readers, enabling them to better understand how these algorithms work. 2. **Algorithm Design and Implementation**: The paper compares and analyzes multiple Policy Gradient algorithms, including REINFORCE, A3C, TRPO, PPO, and V - MPO, etc. Each algorithm has significant differences in design. By comparing these differences, the paper explores the impact of different design choices on algorithm performance and provides high - quality pseudo - code to help readers understand. 3. **Convergence Analysis**: The paper discusses the convergence results in the existing literature, especially regarding Policy Gradient algorithms as an instance of Mirror Learning and gives the corresponding convergence proof. This helps readers understand whether these algorithms can theoretically guarantee convergence to the optimal solution. 4. **Numerical Experiments**: In order to verify the actual performance of these algorithms, the paper conducts a large number of numerical experiments, compares the performance of different algorithms in continuous control environments, and provides insights into the benefits of regularization. In addition, the author also releases the implementation code of these algorithms so that other researchers can reproduce the experimental results or conduct further research on this basis. In summary, this paper aims to provide a comprehensive and in - depth guide for researchers and practitioners in the field of deep reinforcement learning, helping them better understand, design, and implement algorithms based on Policy Gradients. In this way, the paper hopes to promote the development of this field and facilitate the emergence and application of more innovative algorithms. ### Related Formulas - **Policy Gradient Theorem**: \[ \nabla_\theta J(\theta)=\mathbb{E}_{\tau\sim p_\theta(\tau)}\left[\sum_{t = 0}^T\nabla_\theta\log\pi_\theta(a_t|s_t)Q^\pi(s_t,a_t)\right] \] where $\tau$ represents a trajectory, $p_\theta(\tau)$ is the probability distribution of generating a trajectory under the policy $\pi_\theta$, $\nabla_\theta\log\pi_\theta(a_t|s_t)$ is the log - likelihood gradient of the policy, and $Q^\pi(s_t,a_t)$ is the action - value function. - **Update Rule**: \[ \theta_{\text{new}}\leftarrow\theta+\alpha\nabla_\theta J(\theta) \] where $\alpha$ is the step - size parameter. - **Value Function**: \[ V^\pi(s)=\mathbb{E}_\pi\left[G_t\mid S_t = s\right] \] \[ Q^\pi(s,a)=\mathbb{E}_\pi\left[G_t\mid S_t = s,A_t = a\right] \] - **Advantage Function**: \[ A^\pi(s,a)=Q^\pi(s,a)-V^\pi(s) \] These formulas are used in the paper to explain and derive the core concepts and working principles of Policy Gradient algorithms.

The Definitive Guide to Policy Gradients in Deep Reinforcement Learning: Theory, Algorithms and Implementations

A Closer Look at Deep Policy Gradients

Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control

Deep deterministic policy gradient algorithm: A systematic review

Policy ensemble gradient for continuous control problems in deep reinforcement learning

Behind the Myth of Exploration in Policy Gradients

Learning Optimal Deterministic Policies with Stochastic Policy Gradients

An Off-policy Policy Gradient Theorem Using Emphatic Weightings

Policy Gradient for Reinforcement Learning with General Utilities

Policy Gradient Algorithms Implicitly Optimize by Continuation

ETGL-DDPG: A Deep Deterministic Policy Gradient Algorithm for Sparse Reward Continuous Control

The Reinforce Policy Gradient Algorithm Revisited

Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning

Global Convergence of Policy Gradient Methods in Reinforcement Learning, Games and Control

Deterministic Value-Policy Gradients

A Large Deviations Perspective on Policy Gradient Algorithms

Computational Performance of Deep Reinforcement Learning to find Nash Equilibria

Deterministic Policy Gradients with General State Transitions

Beyond Expected Returns: A Policy Gradient Algorithm for Cumulative Prospect Theoretic Reinforcement Learning

Identifying Policy Gradient Subspaces

Comparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control