Abstract:Recent In-Network Aggregation (INA) solutions offload the all-reduce operation onto network switches to accelerate and scale distributed training (DT). On end hosts, these solutions build custom network stacks to replace the transport layer. The INA-oriented network stack cannot take advantage of the state-of-the-art performant transport layer implementation, and also causes complexity in system development and operation. We design a transport-transparent INA primitive named NetReduce for modern multi-rack data centers. NetReduce runs beneath the transport layer. The switch performs aggregation operations but preserves data transmission connections. The host uses RoCE as its transport layer to deliver gradient messages and receive aggregation results. NetReduce achieves performance gains from both INA and RoCE: linear scalability, traffic reduction, and bandwidth freeing-up from INA — high throughput, low latency, and low CPU overhead from RoCE. For jobs spanning several multi-GPU machines, we also devise parallel all-reduce based on NetReduce to make use of intra-machine and inter-machine bandwidth efficiently. We prototype NetReduce on an FPGA board attached to an Ethernet switch. We compare NetReduce with existing programmable switch-based solutions and justify the FPGA-based design choice. We evaluate NetReduce’s performance by training typical Deep Neural Network models on single-GPU and multi-GPU testbeds. NetReduce inter-operates with the existing Ethernet transport layer, is training-framework friendly, accelerates network-intensive DT jobs effectively (e.g., 70% for AlexNet), reduces CPU overheads (e.g., only one core for transmission), and is cost-effective (e.g., only 2.40% more capital expense and 0.68% more power consumption making 12.3-57.9% more performance acceleration).

Optimizing Deep Learning Frameworks Incrementally to Get Linear Speedup: A Comparison Between IPoIB and RDMA Verbs

Improving the Performance of Distributed MXNet with RDMA.

NetReduce: RDMA-Compatible In-Network Reduction for Distributed DNN Training Acceleration

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

Optimization of RDMA-Based HDFS Data Distribution Mechanism.

Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes

Maximizing the Benefit of RDMA at End Hosts

Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

RM-KVStore: New MXNet KVStore to Accelerate Transfer Performancewith RDMA.

Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster

Improving the Performance of Distributed TensorFlow with RDMA

On Optimizing the Communication of Model Parallelism

Accelerating the Shuffle Phase to Speed Up MapReduce Systems

Towards Zero Copy Dataflows using RDMA

A Memory-efficient Hybrid Parallel Framework for Deep Neural Network Training

An Optimized RDMA QP Communication Mechanism for Hyperscale AI Infrastructure

In-Network Aggregation with Transport Transparency for Distributed Training

Layer-Wise Partitioning and Merging for Efficient and Scalable Deep Learning

Host-driven In-Network Aggregation on RDMA

High-Speed Data Communication with Advanced Networks in Large Language Model Training

Accelerating Spark Shuffle with RDMA