Abstract:AllReduce is an important and popular collective communication primitive, which has been widely used in areas such as distributed machine learning and high performance computing. To design, analyze, and choose from various algorithms and implementations of AllReduce, the time cost model plays a crucial role, and the predominant one is the $(\alpha,\beta,\gamma)$ model. In this paper, we revisit this model, and reveal that it cannot well characterize the time cost of AllReduce on modern clusters; thus must be updated. We perform extensive measurements to identify two additional terms contributing to the time cost: the incast term and the memory access term. We augment the $(\alpha,\beta,\gamma)$ model with these two terms, and present GenModel as a result. Using GenModel, we discover two new optimalities for AllReduce algorithms, and prove that they cannot be achieved simultaneously. Finally, striking the balance between the two new optimalities, we design GenTree, an AllReduce plan generation algorithm specialized for tree-like topologies. Experiments on a real testbed with 64 GPUs show that GenTree can achieve 1.22$\times$ to 1.65$\times$ speed-up against NCCL. Large-scale simulations also confirm that GenTree can improve the state-of-the-art AllReduce algorithm by a factor of $1.2$ to $7.4$ in scenarios where the two new terms dominate.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issue of inaccurate time cost models for AllReduce operations in modern clusters. Specifically: 1. **Deficiencies of Existing Time Cost Models**: The currently widely used (α, β, γ) model fails to accurately describe the time cost of AllReduce operations in modern clusters. Therefore, this model needs to be updated. 2. **Introducing New Factors**: The paper identifies two new influencing factors—**incast** and **memory access cost**. These factors are becoming increasingly important in large-scale modern clusters. - **Incast**: When multiple machines engage in one-to-many communication, the actual available bandwidth is less than the assumed value due to bandwidth competition, introducing additional overhead. - **Memory Access Cost**: As host bandwidth increases, the gap between network bandwidth and memory bandwidth narrows, making memory access cost non-negligible. 3. **Proposing a New Model**: Based on the above factors, the paper proposes an enhanced time cost model—GenModel, which includes the three components of the (α, β, γ) model and adds an incast term (ε) and a memory access term (δ). GenModel can more accurately predict the time cost of AllReduce operations under given network conditions. 4. **Algorithm Optimization**: The paper further analyzes and finds that the goals of incast optimization (i.e., minimizing incast overhead) and memory access optimization (i.e., minimizing memory access cost) cannot be achieved simultaneously and must be balanced. Based on this, the paper designs a heuristic algorithm for tree topology, GenTree, to balance these two goals. 5. **Experimental Validation**: The paper validates the effectiveness of GenModel and GenTree through tests in real environments and large-scale simulation experiments. The results show that GenTree achieves a speedup of 1.2 to 7.4 times compared to the existing state-of-the-art algorithms in different scenarios. In summary, this paper aims to improve the performance of AllReduce in modern clusters by enhancing the time cost model and designing more efficient algorithms.

Revisiting the Time Cost Model of AllReduce

Quick attribute reduction with generalized indiscernibility models.

Reliable Estimation of Execution Time of MapReduce Program

Near-Optimal Wafer-Scale Reduce

Efficient Inter-Datacenter AllReduce With Multiple Trees

RAT - Resilient Allreduce Tree for Distributed Machine Learning

Quantifying and Mitigating Computational Inefficiency of Genomics Data Analysis

Optimizing Resource Allocation for Data-Parallel Jobs Via GCN-Based Prediction

Sparse Allreduce: Efficient Scalable Communication for Power-Law Data

Optimal, Non-pipelined Reduce-scatter and Allreduce Algorithms

Node-Aware Improvements to Allreduce

Optimizing Large Model Training through Overlapped Activation Recomputation

S2 Reducer: High-Performance Sparse Communication to Accelerate Distributed Deep Learning

Bandwidth Reduction using Importance Weighted Pruning on Ring AllReduce

Efficient and Eventually Consistent Collective Operations

Cga: Combining Cluster Analysis With Genetic Algorithm For Regression Suite Reduction Of Microprocessors

Full-Stack Allreduce on Multi-Rail Networks

On the Execution Mechanisms of Parallel Graph Reduction

The Optimization of Cost-Model for Join Operator on Spark SQL Platform

Efficient Cross-Cloud Partial Reduce With CREW

A-MapCG: an Adaptive MapReduce Framework for GPUs.