Abstract:Recent In-Network Aggregation (INA) solutions offload the all-reduce operation onto network switches to accelerate and scale distributed training (DT). On end hosts, these solutions build custom network stacks to replace the transport layer. The INA-oriented network stack cannot take advantage of the state-of-the-art performant transport layer implementation, and also causes complexity in system development and operation. We design a transport-transparent INA primitive named NetReduce for modern multi-rack data centers. NetReduce runs beneath the transport layer. The switch performs aggregation operations but preserves data transmission connections. The host uses RoCE as its transport layer to deliver gradient messages and receive aggregation results. NetReduce achieves performance gains from both INA and RoCE: linear scalability, traffic reduction, and bandwidth freeing-up from INA — high throughput, low latency, and low CPU overhead from RoCE. For jobs spanning several multi-GPU machines, we also devise parallel all-reduce based on NetReduce to make use of intra-machine and inter-machine bandwidth efficiently. We prototype NetReduce on an FPGA board attached to an Ethernet switch. We compare NetReduce with existing programmable switch-based solutions and justify the FPGA-based design choice. We evaluate NetReduce’s performance by training typical Deep Neural Network models on single-GPU and multi-GPU testbeds. NetReduce inter-operates with the existing Ethernet transport layer, is training-framework friendly, accelerates network-intensive DT jobs effectively (e.g., 70% for AlexNet), reduces CPU overheads (e.g., only one core for transmission), and is cost-effective (e.g., only 2.40% more capital expense and 0.68% more power consumption making 12.3-57.9% more performance acceleration).

XAgg: Accelerating Heterogeneous Distributed Training Through XDP-Based Gradient Aggregation

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

SAP-SGD: Accelerating Distributed Parallel Training with High Communication Efficiency on Heterogeneous Clusters

Adaptive Consensus Gradients Aggregation for Scaled Distributed Training

Identifying Performance Bottleneck in Shared In-Network Aggregation During Distributed Training

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training

A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters

EP4DDL: addressing straggler problem in heterogeneous distributed deep learning

Peering Beyond the Gradient Veil with Distributed Auto Differentiation

AggTree: A Routing Tree with In-Network Aggregation for Distributed Training

ABS-SGD: A Delayed Synchronous Stochastic Gradient Descent Algorithm with Adaptive Batch Size for Heterogeneous GPU Clusters.

Heter-Train: A Distributed Training Framework Based on Semi-Asynchronous Parallel Mechanism for Heterogeneous Intelligent Transportation Systems

Joint Dynamic Grouping and Gradient Coding for Time-Critical Distributed Machine Learning in Heterogeneous Edge Networks

ATP: In-network Aggregation for Multi-tenant Learning.

DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging

Near-Lossless Gradient Compression for Data-Parallel Distributed DNN Training

Accelerated Distributed Aggregative Optimization

In-Network Aggregation with Transport Transparency for Distributed Training

An Efficient Bandwidth-Adaptive Gradient Compression Algorithm for Distributed Training of Deep Neural Networks

Sparse Gradient Compression For Distributed Sgd