Augmenting Distributed AI Training with Loss-tolerant Transmission.

Zixuan Chen,Lei Shi,Yongbo Gao,Xuandong Liu,Xin Ai,Sen Liu,Yang Xu
DOI: https://doi.org/10.1145/3603165.3607399
2023-01-01
Abstract:Parameter server (PS) communication architecture in distributed machine learning (DML) systems is utilized to enhance the speed of model training in data centers (DCs) and edge nodes. However, it faces severe long-tail latency caused by many-to-one "incast" traffic patterns and suffers from non-congestion packet loss, negatively impacting training throughput. To address this challenge, we design the Loss-tolerant Transmission Protocol (LTP), which permits partial loss of gradients during synchronization to avoid unneeded retransmission and contributes to faster synchronization per iteration. Moreover, the preliminary evaluation shows that LTP outperforms other schemes on both communication latency and training accuracy.
What problem does this paper attempt to address?