Abstract:In this paper, we study a few challenging theoretical and numerical issues on the well known trust region policy optimization for deep reinforcement learning. The goal is to find a policy that maximizes the total expected reward when the agent acts according to the policy. The trust region subproblem is constructed with a surrogate function coherent to the total expected reward and a general distance constraint around the latest policy. We solve the subproblem using a preconditioned stochastic gradient method with a line search scheme to ensure that each step promotes the model function and stays in the trust region. To overcome the bias caused by sampling to the function estimations under the random settings, we add the empirical standard deviation of the total expected reward to the predicted increase in a ratio in order to update the trust region radius and decide whether the trial point is accepted. Moreover, for a Gaussian policy which is commonly used for continuous action space, the maximization with respect to the mean and covariance is performed separately to control the entropy loss. Our theoretical analysis shows that the deterministic version of the proposed algorithm tends to generate a monotonic improvement of the total expected reward and the global convergence is guaranteed under moderate assumptions. Comparisons with the state-of-the-art methods demonstrate the effectiveness and robustness of our method over robotic controls and game playings from OpenAI Gym.

Extreme Trust Region Policy Optimization for Active Object Recognition

An Off-Policy Trust Region Policy Optimization Method with Monotonic Improvement Guarantee for Deep Reinforcement Learning

Active Object Recognition Using Hierarchical Local-Receptive-field-based Extreme Learning Machine

Active Visual Perception Enhancement Method Based on Deep Reinforcement Learning

A Stochastic Trust-Region Framework for Policy Optimization

Trust Region-Guided Proximal Policy Optimization

Active Object Perceiver: Recognition-Guided Policy Learning for Object Searching on Mobile Robots

Human Action Recognition with Contextual Constraints Using a RGB-D Sensor

Supported Trust Region Optimization for Offline Reinforcement Learning

Active 6D Multi-Object Pose Estimation in Cluttered Scenarios with Deep Reinforcement Learning

Uncertainty-Aware Policy Optimization: A Robust, Adaptive Trust Region Approach

In-Hand Manipulation For Active Object Recognition

Extreme Low-Resolution Action Recognition with Confident Spatial-Temporal Attention Transfer

Learning to Constrain Policy Optimization with Virtual Trust Region

Model-Ensemble Trust-Region Policy Optimization

Learning to View: Decision Transformers for Active Object Detection

End-to-end Active Object Tracking Via Reinforcement Learning

End-to-End Active Object Tracking and Its Real-World Deployment Via Reinforcement Learning

DTCM: Joint Optimization of Dark Enhancement and Action Recognition in Videos

Early Action Recognition With Category Exclusion Using Policy-Based Reinforcement Learning

Deep Active Contours for Real-time 6-Dof Object Tracking