Abstract:Efficient and stable exploration remains a key challenge for deep reinforcement learning (DRL) operating in high-dimensional action and state spaces. Recently, a more promising approach by combining the exploration in the action space with the exploration in the parameters space has been proposed to get the best of both methods. In this article, we propose a new iterative and close-loop framework by combining the evolutionary algorithm (EA), which does explorations in a gradient-free manner directly in the parameters space with an actor-critic, and the deep deterministic policy gradient (DDPG) reinforcement learning algorithm, which does explorations in a gradient-based manner in the action space to make these two methods cooperate in a more balanced and efficient way. In our framework, the policies represented by the EA population (the parametric perturbation part) can evolve in a guided manner by utilizing the gradient information provided by the DDPG and the policy gradient part (DDPG) is used only as a fine-tuning tool for the best individual in the EA population to improve the sample efficiency. In particular, we propose a criterion to determine the training steps required for the DDPG to ensure that useful gradient information can be generated from the EA generated samples and the DDPG and EA part can work together in a more balanced way during each generation. Furthermore, within the DDPG part, our algorithm can flexibly switch between fine-tuning the same previous RL-Actor and fine-tuning a new one generated by the EA according to different situations to further improve the efficiency. Experiments on a range of challenging continuous control benchmarks demonstrate that our algorithm outperforms related works and offers a satisfactory trade-off between stability and sample efficiency.

Dynamic Policy Programming with Descending Regularization for Efficient Reinforcement Learning Control

The Ladder in Chaos: A Simple and Effective Improvement to General DRL Algorithms by Policy Path Trimming and Boosting

Learning with Training Wheels: Speeding up Training with a Simple Controller for Deep Reinforcement Learning

Regularly Updated Deterministic Policy Gradient Algorithm

An Active Exploration Method for Data Efficient Reinforcement Learning

Improve PID Controller Through Reinforcement Learning

Deep Reinforcement Learning Using Least‐squares Truncated Temporal‐difference

PP-PG: Combining Parameter Perturbation with Policy Gradient Methods for Effective and Efficient Explorations in Deep Reinforcement Learning

Approximate Policy-Based Accelerated Deep Reinforcement Learning.

Deep Reinforcement Learning with Robust Deep Deterministic Policy Gradient

Reward-Adaptive Reinforcement Learning: Dynamic Policy Gradient Optimization for Bipedal Locomotion

Solving Reach-Avoid-Stay Problems Using Deep Deterministic Policy Gradients

Safe Deep Policy Adaptation

Model Free Deep Deterministic Policy Gradient Controller for Setpoint Tracking of Non-minimum Phase Systems

Asynchronous Episodic Deep Deterministic Policy Gradient: Toward Continuous Control in Computationally Complex Environments

Actor-Critic Reinforcement Learning with Phased Actor

Data-Efficient Reinforcement Learning Using Active Exploration Method.

Live in the Moment: Learning Dynamics Model Adapted to Evolving Policy

Deterministic policy gradient based optimal control with probabilistic constraints

Deterministic Value-Policy Gradients

Combing Policy Evaluation and Policy Improvement in a Unified F-Divergence Framework.