Abstract:The policy gradient theorem states that the policy should only be updated in states that are visited by the current policy, which leads to insufficient planning in the off-policy states, and thus to convergence to suboptimal policies. We tackle this planning issue by extending the policy gradient theory to policy updates with respect to any state density. Under these generalized policy updates, we show convergence to optimality under a necessary and sufficient condition on the updates' state densities, and thereby solve the aforementioned planning issue. We also prove asymptotic convergence rates that significantly improve those in the policy gradient literature. To implement the principles prescribed by our theory, we propose an agent, Dr Jekyll & Mr Hyde (JH), with a double personality: Dr Jekyll purely exploits while Mr Hyde purely explores. JH's independent policies allow to record two separate replay buffers: one on-policy (Dr Jekyll's) and one off-policy (Mr Hyde's), and therefore to update JH's models with a mixture of on-policy and off-policy updates. More than an algorithm, JH defines principles for actor-critic algorithms to satisfy the requirements we identify in our analysis. We extensively test on finite MDPs where JH demonstrates a superior ability to recover from converging to a suboptimal policy without impairing its speed of convergence. We also implement a deep version of the algorithm and test it on a simple problem where it shows promising results.

On- and Off-Policy Monotonic Policy Improvement

An Off-Policy Trust Region Policy Optimization Method with Monotonic Improvement Guarantee for Deep Reinforcement Learning

Easy Monotonic Policy Iteration

An Analytical Update Rule for General Policy Optimization

Off-Policy Policy Gradient with State Distribution Correction

Monotonic Robust Policy Optimization with Model Discrepancy.

On-Policy Trust Region Policy Optimisation with Replay Buffers

Theoretically Guaranteed Policy Improvement Distilled from Model-Based Planning

When Do Off-Policy and On-Policy Policy Gradient Methods Align?

Absolute Policy Optimization

On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling

Robust On-Policy Sampling for Data-Efficient Policy Evaluation in Reinforcement Learning

Variance-Reduced Off-Policy Memory-Efficient Policy Search

Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning

Off-Policy Policy Gradient Algorithms by Constraining the State Distribution Shift

Dr Jekyll and Mr Hyde: the Strange Case of Off-Policy Policy Updates

Efficient Evaluation of Natural Stochastic Policies in Offline Reinforcement Learning

On the Reuse Bias in Off-Policy Reinforcement Learning

Separated Trust Regions Policy Optimization Method

Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation

Policy Gradient Algorithms Implicitly Optimize by Continuation