Abstract:The \texttt{MPI\_Allreduce} collective operation is a core kernel of many parallel codebases, particularly for reductions over a single value per process. The commonly used allreduce recursive-doubling algorithm obtains the lower bound message count, yielding optimality for small reduction sizes based on node-agnostic performance models. However, this algorithm yields duplicate messages between sets of nodes. Node-aware optimizations in MPICH remove duplicate messages through use of a single master process per node, yielding a large number of inactive processes at each inter-node step. In this paper, we present an algorithm that uses the multiple processes available per node to reduce the maximum number of inter-node messages communicated by a single process, improving the performance of allreduce operations, particularly for small message sizes.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: the existing MPI_Allreduce algorithm has poor performance when dealing with data of small message sizes, especially in large - scale parallel computing environments. Specifically, although the traditional recursive - doubling algorithm reaches the lower bound in terms of the number of messages, there is a problem of duplicate messages in inter - node communication, resulting in unnecessary communication overhead. And the existing node - aware optimization method, although it reduces duplicate messages by using a single master process on each node, this method will cause a large number of processes to be idle in each inter - node communication step, thus causing load imbalance. To solve these problems, the paper proposes a new Node - Aware Parallel (NAP) Allreduce algorithm. This algorithm utilizes multiple processes on each node to reduce the maximum number of inter - node communications that a single process needs to perform, thereby improving the performance of the Allreduce operation, especially when dealing with data of small message sizes. The NAP algorithm reduces the number of inter - node communications from \(\log_2(n)\) to \(\log_{ppn}(n)\) by adding intra - node communication and local computation steps, where \(n\) is the number of nodes involved and \(ppn\) is the number of processes on each node. This enables the NAP algorithm to significantly improve performance compared to the traditional recursive - doubling and the existing node - aware SMP methods in the case of small message sizes. Through performance model analysis and experimental result verification, the NAP algorithm performs well in the case of small message sizes, especially when the number of processes increases. However, for larger message sizes, the SMP method is still superior to the NAP method. Therefore, the NAP algorithm is particularly suitable for application scenarios that require efficient processing of data with small message sizes in large - scale parallel environments.

Node-Aware Improvements to Allreduce

Optimal, Non-pipelined Reduce-scatter and Allreduce Algorithms

NUMA-aware shared-memory collective communication for MPI

Efficient and Eventually Consistent Collective Operations

Optimizing Irregular Communication with Neighborhood Collectives and Locality-Aware Parallelism

Network states-aware collective communication optimization

Sparse Allreduce: Efficient Scalable Communication for Power-Law Data

Near-Optimal Wafer-Scale Reduce

Heterogeneity-Aware Distributed Machine Learning Training Via Partial Reduce

Decomposing Collectives for Exploiting Multi-lane Communication

Efficient All-reduce for Distributed DNN Training in Optical Interconnect System

Node Aware Sparse Matrix-Vector Multiplication

Revisiting the Time Cost Model of AllReduce

Improving all-reduce collective operations for imbalanced process arrival patterns

An Asynchronous Algorithm to Reduce the Number of Data Exchanges

Effective method of collective communication for message passing on cluster

BoostN: Optimizing Imbalanced Neighborhood Communication on Homogeneous Many-Core System

A Greedy Algorithm for Optimally Pipelining a Reduction

Full-Stack Allreduce on Multi-Rail Networks

Efficient Inter-Datacenter AllReduce With Multiple Trees

Message-Combining Algorithms for Isomorphic, Sparse Collective Communication