Impact of Network Topology on the Performance of DML: Theoretical Analysis and Practical Factors
Shuai Wang,Dan Li,Jinkun Geng,Yue Gu,Yang Cheng
DOI: https://doi.org/10.1109/infocom.2019.8737595
2019-01-01
Abstract:To deal with the increasingly larger input data and model sizes, it has become necessary to scale the training of machine learning models to multiple nodes, even a server cluster, which we call distributed machine learning, or DML. However, DML utilizes more computation power at the cost of high communication overhead, which may limit the overall performance in turn. In this paper, we study the impact of network topology on the DML performance both in theory and in practice. We compare two representative network topologies, namely, Fat-Tree which is widely-used in modern data centers, and BCube, which is a low-cost and server-centric network topology, both running on top of RDMA. The results show that Fat-Tree not only has theoretically higher global synchronization time (GST) than BCube, but its practical GST (by NS-3 based simulation) is also considerably larger than the theoretical one. By analyzing the large-scale simulation traces, we find that the root cause for the gap in Fat-Tree comes from the load imbalance among the multiple parallel paths as well as the inevitable PFC frames, both of which do not appear in BCube. For a cluster of around 250 servers, BCube achieves 53%\sim 70% lower GST than Fat-Tree from the simulation. As a result, we suggest using server-centric network topology such as BCube, instead of the common Fat-Tree network, to build a special-purpose DML cluster, due to its parallel synchronization, RDMA friendliness, natural load balance, as well as low economical cost.