Abstract:Message Passing Interface (MPI) is the de facto standard for parallel programming, and collective operations in MPI are widely utilized by numerous scientific applications. The efficiency of these collective operations greatly impacts the performance of parallel applications. With the increasing scale and heterogeneity of HPC systems, the network environment has become more complex. The network states vary widely and dynamically between node pairs, and this makes it more difficult to design efficient collective communication algorithms. In this paper, we propose a method to optimize collective operations by using real-time measured network states, specifically focusing on the binomial tree algorithm. Our approach employs a low-overhead method to measure the network states, and the binomial tree with small latency is constructed based on the measurement result. Additionally, we take into account the disparities between the two underlying MPI peer-to-peer communication protocols, eager and rendezvous, and design tailored binomial tree construction algorithms for each protocol. We have implemented hierarchical MPI_Bcast, MPI_Reduce, MPI_Gather and MPI_Scatter, utilizing our network states-aware binomial tree algorithm at the inter-node level. The benchmark results demonstrate that our algorithm effectively enhances performance in small and medium message communication when compared to the default binomial tree algorithm in Open MPI. Specifically, for MPI_Bcast, we observe an average performance improvement of over 15.5% when the message size is less than 64KB. Similarly, for MPI_Reduce, there is an average performance improvement of over 12.1% when the message is below 2KB. In addition, there is an average performance improvement of over 10% for MPI_Gather when the message ranging from 64B to 512B. For MPI_Scatter, our algorithm achieved performance improvement only for certain message sizes.

A Systemic Strategy for Tuning Intra-node Collective Communication on Multicore Systems

Optimizing MPI Collectives on Shared Memory Multi-Cores

Improved MPI Collectives for MPI Processes in Shared Address Spaces.

NUMA-aware shared-memory collective communication for MPI

A Survey of Methods for Collective Communication Optimization and Tuning

Performance Evaluation and Optimization of Inter-cores Communication for Heterogeneous Multi-core Processor Unit

Core-aware Combining: Accelerating Critical Section Execution on Heterogeneous Multi-Core Systems Via Combining Synchronization

Network states-aware collective communication optimization

Decomposing Collectives for Exploiting Multi-lane Communication

Iteration Based Collective I/O Strategy For Parallel I/O Systems

Accelerating MPI Collectives with Process-in-Process-based Multi-object Techniques

Effective method of collective communication for message passing on cluster

Latency-Balanced Optimization of MPI Collective Communication across Multi-clusters

An Evaluation of Per-Chip Nonuniform Frequency Scaling on Multicores.

Performance Evaluation And Tuning Of 2d Jacobi Iteration On Many-Core Machines

Hybrid-optimization Strategy for the Communication of Large-Scale Kinetic Monte Carlo Simulation

BoostN: Optimizing Imbalanced Neighborhood Communication on Homogeneous Many-Core System

Performance Optimization of a CFD Application on Intel Multicore and Manycore Architectures

Multiple Virtual Lanes-aware MPI collective communication in multi-core clusters

Optimizing Irregular Communication with Neighborhood Collectives and Locality-Aware Parallelism

Automatic Tuning of Sparse Matrix-Vector Multiplication on Multicore Clusters.