Host-driven In-Network Aggregation on RDMA

Yulong Li,Wenxin Li,Yinan Yao,Yuxuan Du,Keqiu Li
DOI: https://doi.org/10.1109/infocom52122.2024.10621230
2024-01-01
Abstract:Large-scale datacenter networks are increasingly using in-network aggregation (INA) and remote direct memory access (RDMA) techniques to accelerate deep neural network (DNN) training. However, existing research trends suggest that these two techniques are on an inevitable collision course. To fill this gap, we present FreeINA, a host-driven in-network aggregation aimed at providing RDMA reliable connection (RC) for multi-tenant learning settings. FreeINA relies on dual transmission paths to support RC compatibility, with one path for INA and another one for aggregation on end-host parameter server. With dynamic control of these two paths, FreeINA can leave the traditional in-server aggregation unaffected while ensuring INA's reliability without modifying RDMA network interfaces (RNICs). To support multi-tenant learning, FreeINA employs all-reduce-level memory allocation, which can capture the well-known "on and off" DNN training pattern and thus improve switch memory efficiency. We have implemented a FreeINA prototype using P4-programmable switch and commercial RNICs, and evaluated it extensively using 100Gbps testbed. The results show that compared to the state-of-the-art solution-ATP, FreeINA improves single-job training speedup ratio by 1.20x, while improving the aggregation throughput by 2.65x in multi-job scenario.
What problem does this paper attempt to address?