Abstract:Message Passing Interface (MPI) is the de facto standard for parallel programming, and collective operations in MPI are widely utilized by numerous scientific applications. The efficiency of these collective operations greatly impacts the performance of parallel applications. With the increasing scale and heterogeneity of HPC systems, the network environment has become more complex. The network states vary widely and dynamically between node pairs, and this makes it more difficult to design efficient collective communication algorithms. In this paper, we propose a method to optimize collective operations by using real-time measured network states, specifically focusing on the binomial tree algorithm. Our approach employs a low-overhead method to measure the network states, and the binomial tree with small latency is constructed based on the measurement result. Additionally, we take into account the disparities between the two underlying MPI peer-to-peer communication protocols, eager and rendezvous, and design tailored binomial tree construction algorithms for each protocol. We have implemented hierarchical MPI_Bcast, MPI_Reduce, MPI_Gather and MPI_Scatter, utilizing our network states-aware binomial tree algorithm at the inter-node level. The benchmark results demonstrate that our algorithm effectively enhances performance in small and medium message communication when compared to the default binomial tree algorithm in Open MPI. Specifically, for MPI_Bcast, we observe an average performance improvement of over 15.5% when the message size is less than 64KB. Similarly, for MPI_Reduce, there is an average performance improvement of over 12.1% when the message is below 2KB. In addition, there is an average performance improvement of over 10% for MPI_Gather when the message ranging from 64B to 512B. For MPI_Scatter, our algorithm achieved performance improvement only for certain message sizes.

NUMA-aware shared-memory collective communication for MPI

Network states-aware collective communication optimization

An Implementation of Parallel MLFMA on a Cluster of Computers with Distributed Memory

Effective method of collective communication for message passing on cluster

Optimizing Irregular Communication with Neighborhood Collectives and Locality-Aware Parallelism

Decomposing Collectives for Exploiting Multi-lane Communication

An Efficient Shared Memory Based Virtual Communication System for Embedded SMP Cluster

A Survey of Potential MPI Complex Collectives: Large-Scale Mining and Analysis of HPC Applications

Automatic Tuning of Sparse Matrix-Vector Multiplication on Multicore Clusters.

Customized Network-on-Chip Oriented to MPI Collective Operations

Array-level Collective Communications

BoostN: Optimizing Imbalanced Neighborhood Communication on Homogeneous Many-Core System

Process Mapping for MPI Collective Communications.

Efficient and Eventually Consistent Collective Operations

Message-Combining Algorithms for Isomorphic, Sparse Collective Communication

An Optimized Error-controlled MPI Collective Framework Integrated with Lossy Compression

Shared Memory-Aware Latency-Sensitive Message Aggregation for Fine-Grained Communication

Node-Aware Improvements to Allreduce

MPI+Threads: runtime contention and remedies

L1 Collective Cache: Managing Shared Data for Chip Multiprocessors

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided