Abstract:Efficient and stable exploration remains a key challenge for deep reinforcement learning (DRL) operating in high-dimensional action and state spaces. Recently, a more promising approach by combining the exploration in the action space with the exploration in the parameters space has been proposed to get the best of both methods. In this article, we propose a new iterative and close-loop framework by combining the evolutionary algorithm (EA), which does explorations in a gradient-free manner directly in the parameters space with an actor-critic, and the deep deterministic policy gradient (DDPG) reinforcement learning algorithm, which does explorations in a gradient-based manner in the action space to make these two methods cooperate in a more balanced and efficient way. In our framework, the policies represented by the EA population (the parametric perturbation part) can evolve in a guided manner by utilizing the gradient information provided by the DDPG and the policy gradient part (DDPG) is used only as a fine-tuning tool for the best individual in the EA population to improve the sample efficiency. In particular, we propose a criterion to determine the training steps required for the DDPG to ensure that useful gradient information can be generated from the EA generated samples and the DDPG and EA part can work together in a more balanced way during each generation. Furthermore, within the DDPG part, our algorithm can flexibly switch between fine-tuning the same previous RL-Actor and fine-tuning a new one generated by the EA according to different situations to further improve the efficiency. Experiments on a range of challenging continuous control benchmarks demonstrate that our algorithm outperforms related works and offers a satisfactory trade-off between stability and sample efficiency.

Efficient sample reuse in policy gradients with parameter-based exploration

Generalize Robot Learning from Demonstration to Variant Scenarios with Evolutionary Policy Gradient

Variance Reduction based Partial Trajectory Reuse to Accelerate Policy Gradient Optimization

PP-PG: Combining Parameter Perturbation with Policy Gradient Methods for Effective and Efficient Explorations in Deep Reinforcement Learning

Optimal Control-Based Baseline for Guided Exploration in Policy Gradient Methods

Model-free Policy Learning with Reward Gradients

Generalizable Policy Improvement Via Reinforcement Sampling (student Abstract)

Behind the Myth of Exploration in Policy Gradients

Hessian Aided Policy Gradient

Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning

Low-Switching Policy Gradient with Exploration via Online Sensitivity Sampling

Variational Policy Gradient Method for Reinforcement Learning with General Utilities

Policy Gradient with Active Importance Sampling

A Simple Mixture Policy Parameterization for Improving Sample Efficiency of CVaR Optimization

Policy Mirror Descent Inherently Explores Action Space

Model-Based Reparameterization Policy Gradient Methods: Theory and Practical Algorithms

Careful at Estimation and Bold at Exploration

Reusing Historical Trajectories in Natural Policy Gradient via Importance Sampling: Convergence and Convergence Rate

Stochastic Cubic-Regularized Policy Gradient Method

Identifying Policy Gradient Subspaces

Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline