Abstract:Reinforcement learning (RL) has attracted much attention recently, as new and emerging AI-based applications are demanding the capabilities to intelligently react to environment changes. Unlike distributed deep neural network (DNN) training, the distributed RL training has its unique workload characteristics - it generates orders of magnitude more iterations with much smaller sized but more frequent gradient aggregations. More specifically, our study with typical RL algorithms shows that their distributed training is latency critical and that the network communication for gradient aggregation occupies up to 83.2% of the execution time of each training iteration. In this paper, we present iSwitch, an in-switch acceleration solution that moves the gradient aggregation from server nodes into the network switches, thus we can reduce the number of network hops for gradient aggregation. This not only reduces the end-to-end network latency for synchronous training, but also improves the convergence with faster weight updates for asynchronous training. Upon the in-switch accelerator, we further reduce the synchronization overhead by conducting on-the-fly gradient aggregation at the granularity of network packets rather than gradient vectors. Moreover, we rethink the distributed RL training algorithms and also propose a hierarchical aggregation mechanism to further increase the parallelism and scalability of the distributed RL training at rack scale. We implement iSwitch using a real-world programmable switch NetFPGA board. We extend the control and data plane of the programmable switch to support iSwitch without affecting its regular network functions. Compared with state-of-the-art distributed training approaches, iSwitch offers a system-level speedup of up to 3.66 for synchronous distributed training and 3.71 for asynchronous distributed training, while achieving better scalability.

Learning Buffer Management Policies for Shared Memory Switches

TRCC: Transferable Congestion Control with Reinforcement Learning

Traffic-aware Buffer Management in Shared Memory Switches

ABS: Adaptive Buffer Sizing Via Augmented Programmability with Machine Learning

L2BM: Switch Buffer Management for Hybrid Traffic in Data Center Networks

Applying Buffer to SDN Switches: Benefits Analysis and Mechanism Design

D2T: Dynamic Dual Threshold Policy of Shared-Memory in Data Center Switches

Analyzing and Enhancing Dynamic Threshold Policy of Data Center Switches

Absorbing Micro-Burst Traffic by Enhancing Dynamic Threshold Policy of Data Center Switches

Credence: Augmenting Datacenter Switch Buffer Sharing with ML Predictions

Re-Architecting Buffer Management in Lossless Ethernet

Cutting Long-Tail Latency of Routing Response in Software Defined Networks.

DECC: Achieving Low Latency in Data Center Networks with Deep Reinforcement Learning

Balancer: A Traffic-Aware Hybrid Rule Allocation Scheme in Software Defined Networks.

Buffer-Assisted Network Updates in Timed SDN

Accelerating distributed reinforcement learning with in-switch computing

A Novel Buffer Management For Input Queued Switches Providing Diffserv

FB: A Flexible Buffer Management Scheme for Data Center Switches

TCP Congestion Management Using Deep Reinforcement Trained Agent for RED

Performance of Various Input-Buffered and Output-Buffered ATM Switch Design Principles under Bursty Traffic: Simulation Study.

Policy Reuse for Communication Load Balancing in Unseen Traffic Scenarios