Abstract:We study a reinforcement learning setting, where the state transition function is a convex combination of a stochastic continuous function and a deterministic function. Such a setting generalizes the widely-studied stochastic state transition setting, namely the setting of deterministic policy gradient (DPG). We firstly give a simple example to illustrate that the deterministic policy gradient may be infinite under deterministic state transitions, and introduce a theoretical technique to prove the existence of the policy gradient in this generalized setting. Using this technique, we prove that the deterministic policy gradient indeed exists for a certain set of discount factors, and further prove two conditions that guarantee the existence for all discount factors. We then derive a closed form of the policy gradient whenever exists. Furthermore, to overcome the challenge of high sample complexity of DPG in this setting, we propose the Generalized Deterministic Policy Gradient (GDPG) algorithm. The main innovation of the algorithm is a new method of applying model-based techniques to the model-free algorithm, the deep deterministic policy gradient algorithm (DDPG). GDPG optimize the long-term rewards of the model-based augmented MDP subject to a constraint that the long-rewards of the MDP is less than the original one. We finally conduct extensive experiments comparing GDPG with state-of-the-art methods and the direct model-based extension method of DDPG on several standard continuous control benchmarks. Results demonstrate that GDPG substantially outperforms DDPG, the model-based extension of DDPG and other baselines in terms of both convergence and long-term rewards in most environments.

Policy Gradient Fuzzy Reinforcement Learning

Safe Reinforcement Learning Using Finite-Horizon Gradient-based Estimation

Convergence Rate of Primal-Dual Approach to Constrained Reinforcement Learning with Softmax Policy

Stochastic Cubic-Regularized Policy Gradient Method

Policy Gradient Reinforcement Learning for Policy Represented by Fuzzy Rules: Application to Simulations of Speed Control of an Automobile

Mixed Policy Gradient: off-policy reinforcement learning driven jointly by data and model

Policy ensemble gradient for continuous control problems in deep reinforcement learning

Model Gradient: Unified Model and Policy Learning in Model-Based Reinforcement Learning

Policy Gradient Methods for Risk-Sensitive Distributional Reinforcement Learning with Provable Convergence

Learning Optimal Deterministic Policies with Stochastic Policy Gradients

A Collaborative Multiagent Reinforcement Learning Method Based on Policy Gradient Potential

vMFER: Von Mises-Fisher Experience Resampling Based on Uncertainty of Gradient Directions for Policy Improvement

Fractal Landscapes in Policy Optimization

A Theoretical Understanding of Gradient Bias in Meta-Reinforcement Learning

A Genetic Fuzzy System for Interpretable and Parsimonious Reinforcement Learning Policies

Variational Policy Gradient Method for Reinforcement Learning with General Utilities

The $f$-Divergence Reinforcement Learning Framework

Model-Based Reparameterization Policy Gradient Methods: Theory and Practical Algorithms

Model-free Policy Learning with Reward Gradients

Deterministic Policy Gradients with General State Transitions

Policy Gradient Method For Robust Reinforcement Learning