RDMA Based Congestion Control Strategy for DML Training Optimization in Data Center Networks

Jiagui Wu,Yang Qin,Weihong Yang,Ruonan Li
DOI: https://doi.org/10.21203/rs.3.rs-1868946/v1
2022-01-01
Abstract:Distributed Machine Learning (DML) is one of the important means to accelerate the training of machine learning models. However, the Remote Direct Memory Access (RDMA) technology applied in the data center network cannot well support the communication of DML during parameter synchronization, and there will be relatively large communication overhead. The TMDML (Traffic Management for DML) proposed in this paper records and maintains the state of the data stream by the network nodes, and allocates the bandwidth according to the network state. TMDML achieves higher communication efficiency while reducing the total time to complete DML tasks by alleviating the slow flow lag problem. On this basis, we design TMDML-NIC and TMDML-Switch to meet the deployment requirements of different network equipment. At the same time, we propose the fluid model of TMDML, and the experiments show that the fluid model is in good agreement with the realization method. Through the fluid model, we can not only visually see the changes of important indicators such as flow rate, but also predict these indicators to a certain extent, and can also use it to better optimize the protocol parameters. We conduct simulation experiments on DCQCN and TMDML for different deep learning models and different topologies, respectively. The experimental results show that, compared with DCQCN, TMDML reduces the number of ECNs per switch, which indicates that TMDML can better alleviate the congestion state in the network. The communication overhead caused by parameter synchronization is all reduced in our training tasks. Especially in the Fat-Tree model, there is at least a 32% reduction. This ratio will be higher on more time-consuming models and topologies with more nodes.
What problem does this paper attempt to address?