Node-Aware Improvements to Allreduce

Amanda Bienz,Luke N. Olson,William D. Gropp
DOI: https://doi.org/10.48550/arXiv.1910.09650
2019-10-22
Abstract:The \texttt{MPI\_Allreduce} collective operation is a core kernel of many parallel codebases, particularly for reductions over a single value per process. The commonly used allreduce recursive-doubling algorithm obtains the lower bound message count, yielding optimality for small reduction sizes based on node-agnostic performance models. However, this algorithm yields duplicate messages between sets of nodes. Node-aware optimizations in MPICH remove duplicate messages through use of a single master process per node, yielding a large number of inactive processes at each inter-node step. In this paper, we present an algorithm that uses the multiple processes available per node to reduce the maximum number of inter-node messages communicated by a single process, improving the performance of allreduce operations, particularly for small message sizes.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: the existing MPI_Allreduce algorithm has poor performance when dealing with data of small message sizes, especially in large - scale parallel computing environments. Specifically, although the traditional recursive - doubling algorithm reaches the lower bound in terms of the number of messages, there is a problem of duplicate messages in inter - node communication, resulting in unnecessary communication overhead. And the existing node - aware optimization method, although it reduces duplicate messages by using a single master process on each node, this method will cause a large number of processes to be idle in each inter - node communication step, thus causing load imbalance. To solve these problems, the paper proposes a new Node - Aware Parallel (NAP) Allreduce algorithm. This algorithm utilizes multiple processes on each node to reduce the maximum number of inter - node communications that a single process needs to perform, thereby improving the performance of the Allreduce operation, especially when dealing with data of small message sizes. The NAP algorithm reduces the number of inter - node communications from \(\log_2(n)\) to \(\log_{ppn}(n)\) by adding intra - node communication and local computation steps, where \(n\) is the number of nodes involved and \(ppn\) is the number of processes on each node. This enables the NAP algorithm to significantly improve performance compared to the traditional recursive - doubling and the existing node - aware SMP methods in the case of small message sizes. Through performance model analysis and experimental result verification, the NAP algorithm performs well in the case of small message sizes, especially when the number of processes increases. However, for larger message sizes, the SMP method is still superior to the NAP method. Therefore, the NAP algorithm is particularly suitable for application scenarios that require efficient processing of data with small message sizes in large - scale parallel environments.