Abstract:In this paper, we study a few challenging theoretical and numerical issues on the well known trust region policy optimization for deep reinforcement learning. The goal is to find a policy that maximizes the total expected reward when the agent acts according to the policy. The trust region subproblem is constructed with a surrogate function coherent to the total expected reward and a general distance constraint around the latest policy. We solve the subproblem using a preconditioned stochastic gradient method with a line search scheme to ensure that each step promotes the model function and stays in the trust region. To overcome the bias caused by sampling to the function estimations under the random settings, we add the empirical standard deviation of the total expected reward to the predicted increase in a ratio in order to update the trust region radius and decide whether the trial point is accepted. Moreover, for a Gaussian policy which is commonly used for continuous action space, the maximization with respect to the mean and covariance is performed separately to control the entropy loss. Our theoretical analysis shows that the deterministic version of the proposed algorithm tends to generate a monotonic improvement of the total expected reward and the global convergence is guaranteed under moderate assumptions. Comparisons with the state-of-the-art methods demonstrate the effectiveness and robustness of our method over robotic controls and game playings from OpenAI Gym.

An Analytical Update Rule for General Policy Optimization

An Off-Policy Trust Region Policy Optimization Method with Monotonic Improvement Guarantee for Deep Reinforcement Learning

Absolute Policy Optimization

Trust Region Policy Optimization

Separated Trust Regions Policy Optimization Method

Dr Jekyll and Mr Hyde: the Strange Case of Off-Policy Policy Updates

A Stochastic Trust-Region Framework for Policy Optimization

On- and Off-Policy Monotonic Policy Improvement

Nearly optimal policy optimization with stable at any time guarantee

Policy Optimization over General State and Action Spaces

Particle Based Stochastic Policy Optimization

Optimistic Natural Policy Gradient: a Simple Efficient Policy Optimization Framework for Online RL

Uncertainty-Aware Policy Optimization: A Robust, Adaptive Trust Region Approach

Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation

Order Matters: Agent-by-agent Policy Optimization.

Optimistic Multi-Agent Policy Gradient

Acceleration in Policy Optimization

Deterministic Policy Optimization by Combining Pathwise and Score Function Estimators for Discrete Action Spaces

On-Policy Trust Region Policy Optimisation with Replay Buffers