Abstract:Recent In-Network Aggregation (INA) solutions offload the all-reduce operation onto network switches to accelerate and scale distributed training (DT). On end hosts, these solutions build custom network stacks to replace the transport layer. The INA-oriented network stack cannot take advantage of the state-of-the-art performant transport layer implementation, and also causes complexity in system development and operation. We design a transport-transparent INA primitive named NetReduce for modern multi-rack data centers. NetReduce runs beneath the transport layer. The switch performs aggregation operations but preserves data transmission connections. The host uses RoCE as its transport layer to deliver gradient messages and receive aggregation results. NetReduce achieves performance gains from both INA and RoCE: linear scalability, traffic reduction, and bandwidth freeing-up from INA — high throughput, low latency, and low CPU overhead from RoCE. For jobs spanning several multi-GPU machines, we also devise parallel all-reduce based on NetReduce to make use of intra-machine and inter-machine bandwidth efficiently. We prototype NetReduce on an FPGA board attached to an Ethernet switch. We compare NetReduce with existing programmable switch-based solutions and justify the FPGA-based design choice. We evaluate NetReduce’s performance by training typical Deep Neural Network models on single-GPU and multi-GPU testbeds. NetReduce inter-operates with the existing Ethernet transport layer, is training-framework friendly, accelerates network-intensive DT jobs effectively (e.g., 70% for AlexNet), reduces CPU overheads (e.g., only one core for transmission), and is cost-effective (e.g., only 2.40% more capital expense and 0.68% more power consumption making 12.3-57.9% more performance acceleration).

Host-driven In-Network Aggregation on RDMA

In-Network Aggregation with Transport Transparency for Distributed Training

Rina: Enhancing Ring-AllReduce with In-network Aggregation in Distributed Model Training

DaDianNao: A Machine-Learning Supercomputer

NetReduce: RDMA-Compatible In-Network Reduction for Distributed DNN Training Acceleration

Identifying Performance Bottleneck in Shared In-Network Aggregation During Distributed Training

Training Job Placement in Clusters with Statistical In-Network Aggregation

ATP: In-network Aggregation for Multi-tenant Learning.

Enabling Switch Memory Management for Distributed Training with In-Network Aggregation.

Preemptive Switch Memory Usage to Accelerate Training Jobs with Shared In-Network Aggregation

AggTree: A Routing Tree with In-Network Aggregation for Distributed Training

AIDTN: Towards a Real-Time AI Optimized DTN System with NVMeoF

DFabric: Scaling Out Data Parallel Applications with CXL-Ethernet Hybrid Interconnects

A Novel Co-design Peta-scale Heterogeneous Cluster for Deep Learning Training

RDMA Load Balancing via Data Partition

Towards Zero Copy Dataflows using RDMA

IN3: A Framework for In-Network Computation of Neural Networks in the Programmable Data Plane

CATERPILLAR: Coarse Grain Reconfigurable Architecture for Accelerating the Training of Deep Neural Networks

Reliable adaptive edge-cloud collaborative DNN inference acceleration scheme combining computing and communication resources in optical networks

Towards connection-scalable RNIC architecture

Csrna: Connection-Scalable RDMA NIC Architecture in Datacenter Environment