N4: Network for N Neural Network Training

Jiasheng Zhou,Shengrui Lin,Hongyan Liu,Xinyang Chen,Pengpai Shi,Longlong Zhu,Dong Zhang
DOI: https://doi.org/10.1109/icc51166.2024.10622794
2024-01-01
Abstract:As the amount of data and complexity of neural network models continue to grow, distributed training has become increasingly crucial for improving training speed. However, the bottleneck of distributed training is the communication overheads among distributed workers. Recent research has shown that performing in-network aggregation using programmable switches is a good way to accelerate distributed training. However, previous work has only targeted specific neural network models and can only be applied in specified network topologies. Administrators may train different models and train them in different network topologies. In order to generalize the approach of using programmable switches to accelerate distributed training, we propose N4, a programmable intra-switch acceleration framework that supports distributed training of multiple neural networks. N4 also realizes the deployment of distributed workers based on any topology. Our experimental results show that N4 ensures high performance and isolation when training numerous neural networks. N4 outperforms state-of-the-art systems, accelerating training for existing methods by up to 3.4×.
What problem does this paper attempt to address?