Abstract:A computing cluster that interconnects multiple compute nodes is used to accelerate distributed reinforcement learning that uses DQN (Deep Q-Network). In distributed reinforcement learning, actor nodes acquire experiences by interacting with a given environment and a learner node optimizes the DQN model. When distributed reinforcement learning is used in practical applications such as robotics, we can assume that actor nodes are located in edge side while the learner node is located in cloud side. In this case, the long-haul communication between them imposes significant communication overheads. However, most prior works simply assume that actors and learner are located closely, and do not take the overheads into account. In this paper, we focus on the practical environment where the actors and learner are located remotely, and they interact via a buffer node that collects information from multiple actor nodes. We implement a prototype system in which the buffer and learner nodes are connected via a 25GbE (Gigabit Ethernet) switch and a 10km optical fiber cable. Although a replay memory functionality is closely associated with the learner side, in this paper we propose to combine the replay memory into the buffer node. In our experiments using the prototype system, the proposed approach is compared with an existing approach in terms of the training efficiency (i.e., training loss) and the transfer efficiency over the long-haul communication (i.e., average priority of transferred experiences). As a result, the training loss of the proposed approach is reduced to 26% of the existing approach, and the average priority is 3.92 times higher than the existing approach after the training loss is converged. These results demonstrate that the proposed approach can improve the training/communication efficiency compared with the existing approach in a practical system that imposes long-haul communication between the actors and learner.

Accelerating Distributed Deep Reinforcement Learning by In-Network Experience Sampling

An Efficient Distributed Reinforcement Learning Architecture for Long-Haul Communication Between Actors and Learner

Ddper - Decentralized Distributed Prioritized Experience Replay.

Accelerating distributed reinforcement learning with in-switch computing

Enabling Robust DRL-Driven Networking Systems Via Teacher-Student Learning

Accelerate Distributed Deep Learning with a Fast Reconfigurable Optical Network

Deep Reinforcement Learning-Based Network Routing Technology for Data Recovery in Exa-Scale Cloud Distributed Clustering Systems

Automatically Reconfigurable Optical Network for HPC System Based on Deep Reinforcement Learning

An Optimal Network-Aware Scheduling Technique for Distributed Deep Learning in Distributed HPC Platforms

A Generalized Deep Reinforcement Learning Model for Distribution Network Reconfiguration with Power Flow-Based Action-Space Sampling

Routing Optimization With Deep Reinforcement Learning in Knowledge Defined Networking

Leveraging Domain Knowledge for Robust Deep Reinforcement Learning in Networking

Acceleration for Deep Reinforcement Learning using Parallel and Distributed Computing: A Survey

A Deep Reinforcement Learning Approach to Efficient Distributed Optimization

Efficient Diversity-based Experience Replay for Deep Reinforcement Learning

When Learning Joins Edge: Real-Time Proportional Computation Offloading Via Deep Reinforcement Learning

DeAR: Accelerating Distributed Deep Learning with Fine-Grained All-Reduce Pipelining

Distributed Caching in Converged Networks: A Deep Reinforcement Learning Approach

Experience-driven Networking: A Deep Reinforcement Learning based Approach

Towards scalable and efficient Deep-RL in edge computing: A game-based partition approach

Distributed Inference Acceleration with Adaptive DNN Partitioning and Offloading