Abstract:Efficient and stable exploration remains a key challenge for deep reinforcement learning (DRL) operating in high-dimensional action and state spaces. Recently, a more promising approach by combining the exploration in the action space with the exploration in the parameters space has been proposed to get the best of both methods. In this article, we propose a new iterative and close-loop framework by combining the evolutionary algorithm (EA), which does explorations in a gradient-free manner directly in the parameters space with an actor-critic, and the deep deterministic policy gradient (DDPG) reinforcement learning algorithm, which does explorations in a gradient-based manner in the action space to make these two methods cooperate in a more balanced and efficient way. In our framework, the policies represented by the EA population (the parametric perturbation part) can evolve in a guided manner by utilizing the gradient information provided by the DDPG and the policy gradient part (DDPG) is used only as a fine-tuning tool for the best individual in the EA population to improve the sample efficiency. In particular, we propose a criterion to determine the training steps required for the DDPG to ensure that useful gradient information can be generated from the EA generated samples and the DDPG and EA part can work together in a more balanced way during each generation. Furthermore, within the DDPG part, our algorithm can flexibly switch between fine-tuning the same previous RL-Actor and fine-tuning a new one generated by the EA according to different situations to further improve the efficiency. Experiments on a range of challenging continuous control benchmarks demonstrate that our algorithm outperforms related works and offers a satisfactory trade-off between stability and sample efficiency.

Policy iteration for parameterized Markov decision processes and its application

Parameterized Markov Decision Process and Its Application to Service Rate Control.

Simulation Optimization Algorithm for SMDPs with Parameterized Randomized Stationary Policies

Policy Iteration Based Feedback Control

Approximate Policy Iteration for Robust Stochastic Control of Multi-agent Markov Decision Processes

From Optimization to Control: Quasi Policy Iteration

Online Markov decision processes with policy iteration

PP-PG: Combining Parameter Perturbation with Policy Gradient Methods for Effective and Efficient Explorations in Deep Reinforcement Learning

Online policy iteration algorithm for semi-Markov switching state-space control processes

Policy Search for the Optimal Control of Markov Decision Processes: A Novel Particle-Based Iterative Scheme

A Simulation Optimization Algorithm for CTMDPs Based on Randomized Stationary Policies

Relative Policy-Transition Optimization for Fast Policy Transfer

Approximate Linear Programming for Decentralized Policy Iteration in Cooperative Multi-agent Markov Decision Processes

Neural Network Approaches for Parameterized Optimal Control

A policy iteration algorithm for non-Markovian control problems

Policy iteration for customer-average performance optimization of closed queueing systems

Efficient Policy Iteration for Robust Markov Decision Processes via Regularization

Approximate Policy Iteration Schemes: A Comparison

Least squares policy iteration with instrumental variables vs. direct policy search: comparison against optimal benchmarks using energy storage

Policy gradient methods for discrete time linear quadratic regulator with random parameters