Abstract:Reinforcement learning (RL) has attracted much attention recently, as new and emerging AI-based applications are demanding the capabilities to intelligently react to environment changes. Unlike distributed deep neural network (DNN) training, the distributed RL training has its unique workload characteristics - it generates orders of magnitude more iterations with much smaller sized but more frequent gradient aggregations. More specifically, our study with typical RL algorithms shows that their distributed training is latency critical and that the network communication for gradient aggregation occupies up to 83.2% of the execution time of each training iteration. In this paper, we present iSwitch, an in-switch acceleration solution that moves the gradient aggregation from server nodes into the network switches, thus we can reduce the number of network hops for gradient aggregation. This not only reduces the end-to-end network latency for synchronous training, but also improves the convergence with faster weight updates for asynchronous training. Upon the in-switch accelerator, we further reduce the synchronization overhead by conducting on-the-fly gradient aggregation at the granularity of network packets rather than gradient vectors. Moreover, we rethink the distributed RL training algorithms and also propose a hierarchical aggregation mechanism to further increase the parallelism and scalability of the distributed RL training at rack scale. We implement iSwitch using a real-world programmable switch NetFPGA board. We extend the control and data plane of the programmable switch to support iSwitch without affecting its regular network functions. Compared with state-of-the-art distributed training approaches, iSwitch offers a system-level speedup of up to 3.66 for synchronous distributed training and 3.71 for asynchronous distributed training, while achieving better scalability.

S peedy z ero : m astering a tari with l imited d ata and t ime

SpeedyZero: Mastering Atari with Limited Data and Time

EfficientZero V2: Mastering Discrete and Continuous Control with Limited Data

Mastering Atari Games with Limited Data

Learning with Training Wheels: Speeding up Training with a Simple Controller for Deep Reinforcement Learning

Sample-efficient multi-agent reinforcement learning with masked reconstruction

Reverse Forward Curriculum Learning for Extreme Sample and Demonstration Efficiency in Reinforcement Learning

SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference

Become a Proficient Player with Limited Data through Watching Pure Videos

Higher Replay Ratio Empowers Sample-Efficient Multi-Agent Reinforcement Learning

SRL: Scaling Distributed Reinforcement Learning to Over Ten Thousand Cores

Dynamic Sparse Training for Deep Reinforcement Learning

Spreeze: High-Throughput Parallel Reinforcement Learning Framework

RLx2: Training a Sparse Deep Reinforcement Learning Model from Scratch

Episodic Reinforcement Learning with Expanded State-reward Space

Efficient Diversity-based Experience Replay for Deep Reinforcement Learning

Application of Deep-RL with Sample-Efficient Method in Mini-games of StarCraft II

Ddper - Decentralized Distributed Prioritized Experience Replay.

SERL: A Software Suite for Sample-Efficient Robotic Reinforcement Learning

Snapshot Reinforcement Learning: Leveraging Prior Trajectories for Efficiency

Accelerating distributed reinforcement learning with in-switch computing