Abstract:Despite the recent success of Deep Reinforcement Learning (DRL) in self-driving cars, robotics and surveillance, training DRL agents takes tremendous amount of time and computation resources. In this article, we aim to accelerate DRL with Prioritized Replay Buffer due to its state-of-the-art performance on various benchmarks. The computation primitives of DRL with Prioritized Replay Buffer include environment emulation, neural network inference, sampling from Prioritized Replay Buffer, updating Prioritized Replay Buffer and neural network training. The speed of running these primitives varies for various DRL algorithms such as Deep Q Network and Deep Deterministic Policy Gradient. This makes a fixed mapping of DRL algorithms inefficient. In this work, we propose a framework for mapping DRL algorithms onto heterogeneous platforms consisting of a multi-core CPU, a GPU and a FPGA. First, we develop specific accelerators for each primitive on CPU, FPGA and GPU. Second, we relax the data dependency between priority update and sampling performed in the Prioritized Replay Buffer. By doing so, the latency caused by data transfer between GPU, FPGA and CPU can be completely hidden without sacrificing the rewards achieved by agents learned using the target DRL algorithms. Finally, given a DRL algorithm specification, our design space exploration automatically chooses the optimal mapping of various primitives based on an analytical performance model. On widely used benchmark environments, our experimental results demonstrate up to 997.3× improvement in training throughput compared with baseline mappings on the same heterogeneous platform. Compared with the state-of-the-art distributed Reinforcement Learning framework RLlib, we achieve 1.06$ imes sim$×∼ 1005× improvement in training throughput.

Efficient Parallel Methods for Deep Reinforcement Learning

Asynchronous Methods for Deep Reinforcement Learning

Efficient Parallel Reinforcement Learning Framework using the Reactor Model

Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU

Spreeze: High-Throughput Parallel Reinforcement Learning Framework

EnvPool: A Highly Parallel Reinforcement Learning Environment Execution Engine

Towards Understanding Asynchronous Advantage Actor-critic: Convergence and Linear Speedup

Data Efficient Training for Reinforcement Learning with Adaptive Behavior Policy Sharing

Parallel learner: A practical deep reinforcement learning framework for multi-scenario games

Actor-Critic Reinforcement Learning with Phased Actor

A Framework for Mapping DRL Algorithms with Prioritized Replay Buffer onto Heterogeneous Platforms

Fast Population-Based Reinforcement Learning on a Single Machine

Efficient Exploration in Deep Reinforcement Learning: A Novel Bayesian Actor-Critic Algorithm

Deep reinforcement learning algorithm based on multi-agent parallelism and its application in game environment

Online meta-learning by parallel algorithm competition

Efficient Deep Reinforcement Learning with Predictive Processing Proximal Policy Optimization

Parallelized Reverse Curriculum Generation

Multi-agent Gradient-Based Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning

Scaling Population-Based Reinforcement Learning with GPU Accelerated Simulation

Efficiently Training On-Policy Actor-Critic Networks in Robotic Deep Reinforcement Learning with Demonstration-like Sampled Exploration

Parallel bootstrap-based on-policy deep reinforcement learning for continuous flow control applications