Abstract:Deep Reinforcement Learning (Deep RL) is a key technology in several domains such as self-driving cars, robotics, surveillance, etc. In Deep RL, using a Deep Neural Network model, an agent learns how to interact with the environment to achieve a certain goal. The efficiency of running a Deep RL algorithm on a hardware architecture is dependent upon several factors including (1) the suitability of the hardware architecture for kernels and computation patterns which are fundamental to Deep RL; (2) the capability of the hardware architecture's memory hierarchy to minimize data-communication latency; and (3) the ability of the hardware architecture to hide overheads introduced by the deeply nested highly irregular computation characteristics in Deep RL algorithms. GPUs have been popular for accelerating RL algorithms, however, they fail to optimally satisfy the above-mentioned requirements. A few recent works have developed highly customized accelerators for specific Deep RL algorithms. However, they cannot be generalized easily to the plethora of Deep RL algorithms and DNN model choices that are available. In this paper, we explore the possibility of developing a unified framework that can accelerate a wide range of Deep RL algorithms including variations in training methods or DNN model structures. We take one step towards this goal by defining a domain-specific high-level abstraction for a widely used broad class of Deep RL algorithms - on-policy Deep RL. Furthermore, we provide a systematic analysis of the performance of state-of-the-art on-policy Deep RL algorithms on CPU-GPU and CPU-FPGA platforms. We target two representative algorithms - PPO and A2C, for application areas - robotics and games. we show that a FPGA-based custom accelerator achieves up to 24× (PPO) and 8× (A2C) speedups on training tasks, and 17× (PPO) and 2.1 × (A2C) improvements on overall throughput, respectively.

How to Efficiently Train Your AI Agent? Characterizing and Evaluating Deep Reinforcement Learning on Heterogeneous Platforms

A Framework for Mapping DRL Algorithms with Prioritized Replay Buffer onto Heterogeneous Platforms

Efficient Parallel Methods for Deep Reinforcement Learning

Towards Hardware Accelerated Reinforcement Learning for Application-Specific Robotic Control

Data Efficient Training for Reinforcement Learning with Adaptive Behavior Policy Sharing

Learning with Training Wheels: Speeding up Training with a Simple Controller for Deep Reinforcement Learning

Scaling Population-Based Reinforcement Learning with GPU Accelerated Simulation

Efficient Deep Reinforcement Learning with Predictive Processing Proximal Policy Optimization

HDPG: hyperdimensional policy-based reinforcement learning for continuous control

Podracer architectures for scalable Reinforcement Learning

Deep Reinforcement Learning for Energy-Efficient on the Heterogeneous Computing Architecture

Integrating human learning and reinforcement learning: A novel approach to agent training

Towards scalable and efficient Deep-RL in edge computing: A game-based partition approach

Efficient Reinforcement Learning On Passive RRAM Crossbar Array

Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs

FinRL-Podracer: High Performance and Scalable Deep Reinforcement Learning for Quantitative Finance

GMI-DRL: Empowering Multi-GPU Deep Reinforcement Learning with GPU Spatial Multiplexing

Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone

Hardware as Policy: Mechanical and Computational Co-Optimization using Deep Reinforcement Learning

Device Placement for Autonomous Vehicles using Reinforcement Learning

Deep Reinforcement Learning: Framework, Applications, and Embedded Implementations