Abstract:Recent In-Network Aggregation (INA) solutions offload the all-reduce operation onto network switches to accelerate and scale distributed training (DT). On end hosts, these solutions build custom network stacks to replace the transport layer. The INA-oriented network stack cannot take advantage of the state-of-the-art performant transport layer implementation, and also causes complexity in system development and operation. We design a transport-transparent INA primitive named NetReduce for modern multi-rack data centers. NetReduce runs beneath the transport layer. The switch performs aggregation operations but preserves data transmission connections. The host uses RoCE as its transport layer to deliver gradient messages and receive aggregation results. NetReduce achieves performance gains from both INA and RoCE: linear scalability, traffic reduction, and bandwidth freeing-up from INA — high throughput, low latency, and low CPU overhead from RoCE. For jobs spanning several multi-GPU machines, we also devise parallel all-reduce based on NetReduce to make use of intra-machine and inter-machine bandwidth efficiently. We prototype NetReduce on an FPGA board attached to an Ethernet switch. We compare NetReduce with existing programmable switch-based solutions and justify the FPGA-based design choice. We evaluate NetReduce’s performance by training typical Deep Neural Network models on single-GPU and multi-GPU testbeds. NetReduce inter-operates with the existing Ethernet transport layer, is training-framework friendly, accelerates network-intensive DT jobs effectively (e.g., 70% for AlexNet), reduces CPU overheads (e.g., only one core for transmission), and is cost-effective (e.g., only 2.40% more capital expense and 0.68% more power consumption making 12.3-57.9% more performance acceleration).

IN3: A Framework for In-Network Computation of Neural Networks in the Programmable Data Plane

A Near Memory Computing FPGA Architecture for Neural Network Acceleration

An Efficient Algorithm for Mapping Deep Learning Applications on the NoC Architecture

Inference-to-complete: A High-performance and Programmable Data-plane Co-processor for Neural-network-driven Traffic Analysis

NetRPC: Enabling In-Network Computation in Remote Procedure Calls

Brain-on-Switch: Towards Advanced Intelligent Network Data Plane via NN-Driven Traffic Analysis at Line-Speed

HuNT: Exploiting Heterogeneous PIM Devices to Design a 3-D Manycore Architecture for DNN Training

NeuronLink: An Efficient Chip-to-Chip Interconnect for Large-Scale Neural Network Accelerators

Exploring the Programmability for Deep Learning Processors: from Architecture to Tensorization

NetNN: Neural Intrusion Detection System in Programmable Networks

Ifpna: A Flexible and Efficient Deep Neural Network Accelerator with a Programmable Data Flow Engine in 28nm CMOS.

ClickINC: In-network Computing as a Service in Heterogeneous Programmable Data-center Networks

P4INC-AOI: When In-Network Computing Meets All-Optical Interconnect for Adaptive and Low-Latency Optical DCN.

P4COM: In-Network Computation with Programmable Switches

Ifpna: A Flexible and Efficient Deep Learning Processor in 28-Nm CMOS Using a Domain-Specific Instruction Set and Reconfigurable Fabric.

Programmable Data Plane Intelligence: Advances, Opportunities, and Challenges

Unleashing In-network Computing on Scientific Workloads

In-Network Aggregation with Transport Transparency for Distributed Training

Serving Multi-DNN Workloads on FPGAs: A Coordinated Architecture, Scheduling, and Mapping Perspective.

When In-Network Computing Meets Distributed Machine Learning

An Integrated FPGA Accelerator for Deep Learning-based 2D/3D Path Planning