Configurable Non-uniform All-to-all Algorithms

Ke Fan,Jens Domke,Seydou Ba,Sidharth Kumar
2024-11-05
Abstract:MPI_Alltoallv generalizes the uniform all-to-all communication (MPI_Alltoall) by enabling the exchange of data blocks of varied sizes among processes. This function plays a crucial role in many applications, such as FFT computation and relational algebra operations. Popular MPI libraries, such as MPICH and OpenMPI, implement MPI_Alltoall using a combination of linear and logarithmic algorithms. However, MPI_Alltoallv typically relies only on variations of linear algorithms, missing the benefits of logarithmic approaches. Furthermore, current algorithms also overlook the intricacies of modern HPC system architectures, such as the significant performance gap between intra-node (local) and inter-node (global) communication. This paper introduces a set of Tunable Non-uniform All-to-all algorithms, denoted TuNA{l}{g}, where g and l refer to global (inter-node) and local (intra-node) communication <a class="link-external link-http" href="http://hierarchies.These" rel="external noopener nofollow">this http URL</a> algorithms consider key factors such as the hierarchical architecture of HPC systems, network congestion, the number of data exchange rounds, and the communication burst size. The algorithm efficiently addresses the trade-off between bandwidth maximization and latency minimization that existing implementations struggle to optimize. We show a performance improvement over the state-of-the-art implementations by factors of 42x and 138x on Polaris and Fugaku, respectively.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
This paper attempts to solve the efficiency problem of non - uniform all - to - all communication in high - performance computing (HPC) systems. Specifically, existing implementations usually rely on linear - time algorithms, ignoring the advantages of logarithmic - time algorithms and failing to fully utilize the hierarchical architecture characteristics of modern HPC systems, such as the huge difference in communication performance between intra - node and inter - node. These problems lead to the optimization dilemma between maximizing bandwidth and minimizing latency. To solve the above problems, the author proposes a tunable - radix non - uniform all - to - all algorithm TuNA (Tunable Non - Uniform All - to - all) and its hierarchical variant TuNAg_l. The main innovations of these algorithms include: 1. **Tunable Radix**: By adjusting the radix \(r\) (from 2 to the total number of processes \(P\)), a trade - off can be made between different communication rounds and data exchange amounts, thereby optimizing performance. 2. **Two - stage Communication Scheme**: Each communication round is divided into two stages: metadata exchange and actual data exchange, to adapt to non - uniform data distribution. 3. **Temporary Buffer Optimization**: Through theoretical analysis, the optimal size of the temporary buffer \(T\) is determined, reducing memory overhead and improving efficiency. 4. **Hierarchical Design**: For the multi - level structure of modern HPC systems, the communication is divided into intra - node and inter - node parts, further improving communication efficiency. Through these improvements, the paper shows significant performance improvements of TuNA and its variants on the Polaris and Fugaku supercomputers, achieving acceleration effects of 42 times and 138 times respectively. ### Formula Summary - **Maximum Number of Communication Rounds**: \(K\leq w\cdot(r - 1)\), where \(w = \lceil\log_r P\rceil\) - **Number of Data Blocks Sent per Round**: Each process can send at most \(r^{w - 1}\) data blocks per round. - **Temporary Buffer Size**: \(B=(P-(K + 1))\), used to store intermediate data blocks. - **Intra - node Communication Rounds**: Using the TuNA algorithm, the radix \(r\in[0, Q]\), where \(Q\) is the number of processes within each node. - **Inter - node Communication Rounds**: A scattered algorithm is adopted, and the network load is optimized by adjusting the block count parameter. These formulas and methods work together to enable TuNA and its variants to achieve efficient non - uniform all - to - all communication in HPC systems of different scales and architectures.