Revisiting the Time Cost Model of AllReduce

Dian Xiong,Li Chen,Youhe Jiang,Dan Li,Shuai Wang,Songtao Wang
2024-09-06
Abstract:AllReduce is an important and popular collective communication primitive, which has been widely used in areas such as distributed machine learning and high performance computing. To design, analyze, and choose from various algorithms and implementations of AllReduce, the time cost model plays a crucial role, and the predominant one is the $(\alpha,\beta,\gamma)$ model. In this paper, we revisit this model, and reveal that it cannot well characterize the time cost of AllReduce on modern clusters; thus must be updated. We perform extensive measurements to identify two additional terms contributing to the time cost: the incast term and the memory access term. We augment the $(\alpha,\beta,\gamma)$ model with these two terms, and present GenModel as a result. Using GenModel, we discover two new optimalities for AllReduce algorithms, and prove that they cannot be achieved simultaneously. Finally, striking the balance between the two new optimalities, we design GenTree, an AllReduce plan generation algorithm specialized for tree-like topologies. Experiments on a real testbed with 64 GPUs show that GenTree can achieve 1.22$\times$ to 1.65$\times$ speed-up against NCCL. Large-scale simulations also confirm that GenTree can improve the state-of-the-art AllReduce algorithm by a factor of $1.2$ to $7.4$ in scenarios where the two new terms dominate.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issue of inaccurate time cost models for AllReduce operations in modern clusters. Specifically: 1. **Deficiencies of Existing Time Cost Models**: The currently widely used (α, β, γ) model fails to accurately describe the time cost of AllReduce operations in modern clusters. Therefore, this model needs to be updated. 2. **Introducing New Factors**: The paper identifies two new influencing factors—**incast** and **memory access cost**. These factors are becoming increasingly important in large-scale modern clusters. - **Incast**: When multiple machines engage in one-to-many communication, the actual available bandwidth is less than the assumed value due to bandwidth competition, introducing additional overhead. - **Memory Access Cost**: As host bandwidth increases, the gap between network bandwidth and memory bandwidth narrows, making memory access cost non-negligible. 3. **Proposing a New Model**: Based on the above factors, the paper proposes an enhanced time cost model—GenModel, which includes the three components of the (α, β, γ) model and adds an incast term (ε) and a memory access term (δ). GenModel can more accurately predict the time cost of AllReduce operations under given network conditions. 4. **Algorithm Optimization**: The paper further analyzes and finds that the goals of incast optimization (i.e., minimizing incast overhead) and memory access optimization (i.e., minimizing memory access cost) cannot be achieved simultaneously and must be balanced. Based on this, the paper designs a heuristic algorithm for tree topology, GenTree, to balance these two goals. 5. **Experimental Validation**: The paper validates the effectiveness of GenModel and GenTree through tests in real environments and large-scale simulation experiments. The results show that GenTree achieves a speedup of 1.2 to 7.4 times compared to the existing state-of-the-art algorithms in different scenarios. In summary, this paper aims to improve the performance of AllReduce in modern clusters by enhancing the time cost model and designing more efficient algorithms.