Abstract:Efficient and stable exploration remains a key challenge for deep reinforcement learning (DRL) operating in high-dimensional action and state spaces. Recently, a more promising approach by combining the exploration in the action space with the exploration in the parameters space has been proposed to get the best of both methods. In this article, we propose a new iterative and close-loop framework by combining the evolutionary algorithm (EA), which does explorations in a gradient-free manner directly in the parameters space with an actor-critic, and the deep deterministic policy gradient (DDPG) reinforcement learning algorithm, which does explorations in a gradient-based manner in the action space to make these two methods cooperate in a more balanced and efficient way. In our framework, the policies represented by the EA population (the parametric perturbation part) can evolve in a guided manner by utilizing the gradient information provided by the DDPG and the policy gradient part (DDPG) is used only as a fine-tuning tool for the best individual in the EA population to improve the sample efficiency. In particular, we propose a criterion to determine the training steps required for the DDPG to ensure that useful gradient information can be generated from the EA generated samples and the DDPG and EA part can work together in a more balanced way during each generation. Furthermore, within the DDPG part, our algorithm can flexibly switch between fine-tuning the same previous RL-Actor and fine-tuning a new one generated by the EA according to different situations to further improve the efficiency. Experiments on a range of challenging continuous control benchmarks demonstrate that our algorithm outperforms related works and offers a satisfactory trade-off between stability and sample efficiency.

Exploring Policy Diversity in Parallel Actor-Critic Learning.

Non-local Policy Optimization via Diversity-regularized Collaborative Exploration

PP-PG: Combining Parameter Perturbation with Policy Gradient Methods for Effective and Efficient Explorations in Deep Reinforcement Learning

A Multi-Agent Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning

Dueling Network Architecture for Multi-Agent Deep Deterministic Policy Gradient

Diversity Actor-Critic: Sample-Aware Entropy Regularization for Sample-Efficient Exploration

DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization

Progressive Diversifying Policy for Multi-Agent Reinforcement Learning

Actor-Critic Reinforcement Learning with Phased Actor

Phasic Parallel-Network Policy: a Deep Reinforcement Learning Framework Based on Action Correlation

Efficiently Training On-Policy Actor-Critic Networks in Robotic Deep Reinforcement Learning with Demonstration-like Sampled Exploration

Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments

Diverse Exploration for Fast and Safe Policy Improvement

Continual Reinforcement Learning with Diversity Exploration and Adversarial Self-Correction

Quality-Diversity Actor-Critic: Learning High-Performing and Diverse Behaviors via Value and Successor Features Critics

Iteratively Learning Novel Strategies with Diversity Measured in State Distances

Learning Diverse Policies with Soft Self-Generated Guidance

Explorer-Actor-Critic: Better Actors for Deep Reinforcement Learning

Generative Actor-Critic: An Off-policy Algorithm Using the Push-forward Model

PDRL: Towards Deeper States and Further Behaviors in Unsupervised Skill Discovery by Progressive Diversity

Proximal Policy Gradient Arborescence for Quality Diversity Reinforcement Learning