Abstract:A computing cluster that interconnects multiple compute nodes is used to accelerate distributed reinforcement learning that uses DQN (Deep Q-Network). In distributed reinforcement learning, actor nodes acquire experiences by interacting with a given environment and a learner node optimizes the DQN model. When distributed reinforcement learning is used in practical applications such as robotics, we can assume that actor nodes are located in edge side while the learner node is located in cloud side. In this case, the long-haul communication between them imposes significant communication overheads. However, most prior works simply assume that actors and learner are located closely, and do not take the overheads into account. In this paper, we focus on the practical environment where the actors and learner are located remotely, and they interact via a buffer node that collects information from multiple actor nodes. We implement a prototype system in which the buffer and learner nodes are connected via a 25GbE (Gigabit Ethernet) switch and a 10km optical fiber cable. Although a replay memory functionality is closely associated with the learner side, in this paper we propose to combine the replay memory into the buffer node. In our experiments using the prototype system, the proposed approach is compared with an existing approach in terms of the training efficiency (i.e., training loss) and the transfer efficiency over the long-haul communication (i.e., average priority of transferred experiences). As a result, the training loss of the proposed approach is reduced to 26% of the existing approach, and the average priority is 3.92 times higher than the existing approach after the training loss is converged. These results demonstrate that the proposed approach can improve the training/communication efficiency compared with the existing approach in a practical system that imposes long-haul communication between the actors and learner.

Alleviating All-to-All Communication for Deep Learning Recommendation Model Inference

BagPipe: Accelerating Deep Recommendation Model Training

NDRec: A Near-Data Processing System for Training Large-Scale Recommendation Models

Proactive Embedding on Cold Data for Deep Learning Recommendation Model Training

Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression

PIFS-Rec: Process-In-Fabric-Switch for Large-Scale Recommendation System Inferences

Hierarchical federated learning based on wireless D2D networks

Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training

Accelerating Recommender Model Training by Dynamically Skipping Stale Embeddings

HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework

Ekko: A Large-Scale Deep Learning Recommender System with Low-Latency Model Update

POSTER: Pattern-Aware Sparse Communication for Scalable Recommendation Model Training.

An Efficient Distributed Reinforcement Learning Architecture for Long-Haul Communication Between Actors and Learner

MTrainS: Improving DLRM training efficiency using heterogeneous memories

A Flexible Embedding-Aware Near Memory Processing Architecture for Recommendation System

AdaS&S: a One-Shot Supernet Approach for Automatic Embedding Size Search in Deep Recommender System

DLRover-RM: Resource Optimization for Deep Recommendation Models Training in the Cloud

UpDLRM: Accelerating Personalized Recommendation using Real-World PIM Architecture

Optimizing Inference Quality with SmartNIC for Recommendation System

Boosting Asynchronous Decentralized Learning with Model Fragmentation

Disaggregating Embedding Recommendation Systems with FlexEMR