Abstract:In this article, we propose a distributional policy-gradient method based on distributional reinforcement learning (RL) and policy gradient. Conventional RL algorithms typically estimate the expectation of return, given a state-action pair. Furthermore, distributional RL algorithms consider the return as a random variable and estimate the return distribution that can characterize the probability of different returns resulted by environmental uncertainties. Thus, the return distribution provides more valuable information than its expectation, leading to superior policies in general. Although distributional RL has been investigated widely in value-based RL methods, very few policy-gradient methods take advantage of distributional RL. To bridge this research gap, we propose a distributional policy-gradient method by introducing a distributional value function to the policy gradient (DVDPG). We estimate the distribution of policy gradient instead of the expectation estimated in conventional policy-gradient RL methods. Furthermore, we propose two policy-gradient value sampling mechanisms to do policy improvement. First, we propose a distribution-probability-sampling method that samples the policy-gradient value according to the quantile probability of return distribution. Second, a uniform sample mechanism is proposed. With our sample mechanisms, the proposed distributional policy-gradient method enhances the stochasticity of the policy gradient, improving the exploration efficiency and benefiting to avoid falling into local optimal solutions. In sparse-reward tasks, the distribution-probability-sampling method outperforms the uniform sample mechanism. In dense-reward tasks, the two sample mechanisms perform similarly. Moreover, we show that the conventional policy-gradient method is a special case of the proposed method. Experimental results on various sparse-reward and dense-reward OpenAI-gym tasks illustrate the efficiency of the proposed method, outperforming baselines in almost environments.

PG-Rainbow: Using Distributional Reinforcement Learning in Policy Gradient Methods

Distributional Policy Gradient with Distributional Value Function

Bayesian Distributional Policy Gradients

Sample-based Distributional Policy Gradient

Policy Gradient Methods for Risk-Sensitive Distributional Reinforcement Learning with Provable Convergence

Distributional Reinforcement Learning With Quantile Regression

Proximal Policy Gradient Arborescence for Quality Diversity Reinforcement Learning

Implicit Quantile Networks for Distributional Reinforcement Learning

On Policy Evaluation Algorithms in Distributional Reinforcement Learning

Revisiting Rainbow: Promoting more Insightful and Inclusive Deep Reinforcement Learning Research

PACER: A Fully Push-forward-based Distributional Reinforcement Learning Algorithm

Regularly Updated Deterministic Policy Gradient Algorithm

Policy ensemble gradient for continuous control problems in deep reinforcement learning

Stochastic Cubic-Regularized Policy Gradient Method

Learn to Interpret Atari Agents.

Fully parameterized quantile function for distributional reinforcement learning

Fully Parameterized Quantile Function for Distributional Reinforcement Learning.

SAPG: Split and Aggregate Policy Gradients

Normality-Guided Distributional Reinforcement Learning for Continuous Control

Bag of Policies for Distributional Deep Exploration

A Distributional Perspective on Reinforcement Learning