Abstract:Policy gradient algorithms for reinforcement learning (RL) have successfully tackled a broad range of high-dimensional continuous RL problems, including many challenging robotic control problems. These algorithms can be largely divided into two categories, i.e., on-policy algorithms and off-policy algorithms. Off-policy deep RL (DRL) algorithms enjoy better sample efficiency than and often outperform on-policy algorithms. However, cutting-edge off-policy algorithms still suffer from the low-quality estimation of policy gradients, resulting in compromised learning performance and high sensitivity to hyper-parameter settings. To address this issue, we propose a new concept of robust policy gradient (RPG). Driven by RPG, this paper further develops a new policy ensemble gradient (PEG) algorithm for DRL, inspired by the recent success of several ensemble DRL algorithms. PEG efficiently and effectively estimates RPG by using multiple policy gradients obtained respectively from several off-policy base learners in an ensemble. The estimated RPG is then utilized for training all base learners simultaneously. Comprehensive experiments have been performed on six Mujoco benchmark problems. Compared to four state-of-the-art off-policy algorithms and four cutting-edge ensemble policy gradient algorithms, our new PEG algorithm achieved highly competitive stability, performance and sample efficiency. Further analysis shows that PEG is insensitive to varied hyper-parameter settings, confirming the positive role of RPG in building reliable and effective off-policy DRL algorithms.

Overcoming Delayed Feedback in Reinforcement Learning Using Actor Ensembles

Actor-Critic Reinforcement Learning with Phased Actor

Mitigating Estimation Errors by Twin TD-Regularized Actor and Critic for Deep Reinforcement Learning

Overcoming Delayed Feedback Via Overlook Decision Making

DEER: A Delay-Resilient Framework for Reinforcement Learning with Variable Delays

TD3 with Composite Forgetting Prioritized Experience Replay

Boosting Reinforcement Learning with Strongly Delayed Feedback Through Auxiliary Short Delays

Actor Prioritized Experience Replay

Double Actor-Critic with TD Error-Driven Regularization in Reinforcement Learning

Optimizing TD3 for 7-DOF Robotic Arm Grasping: Overcoming Suboptimality with Exploration-Enhanced Contrastive Learning

Keep Various Trajectories: Promoting Exploration of Ensemble Policies in Continuous Control

Combining Reinforcement Learning and Tensor Networks, with an Application to Dynamical Large Deviations

Efficient Continuous Control with Double Actors and Regularized Critics

Robust Adaptive Ensemble Adversary Reinforcement Learning

Policy ensemble gradient for continuous control problems in deep reinforcement learning

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Addressing Delays in Reinforcement Learning Via Delayed Adversarial Imitation Learning

Demonstration actor critic

A TD3-based multi-agent deep reinforcement learning method in mixed cooperation-competition environment

Off-Policy Reinforcement Learning with Delayed Rewards