Abstract:Message Passing Interface (MPI) is the de facto standard for parallel programming, and collective operations in MPI are widely utilized by numerous scientific applications. The efficiency of these collective operations greatly impacts the performance of parallel applications. With the increasing scale and heterogeneity of HPC systems, the network environment has become more complex. The network states vary widely and dynamically between node pairs, and this makes it more difficult to design efficient collective communication algorithms. In this paper, we propose a method to optimize collective operations by using real-time measured network states, specifically focusing on the binomial tree algorithm. Our approach employs a low-overhead method to measure the network states, and the binomial tree with small latency is constructed based on the measurement result. Additionally, we take into account the disparities between the two underlying MPI peer-to-peer communication protocols, eager and rendezvous, and design tailored binomial tree construction algorithms for each protocol. We have implemented hierarchical MPI_Bcast, MPI_Reduce, MPI_Gather and MPI_Scatter, utilizing our network states-aware binomial tree algorithm at the inter-node level. The benchmark results demonstrate that our algorithm effectively enhances performance in small and medium message communication when compared to the default binomial tree algorithm in Open MPI. Specifically, for MPI_Bcast, we observe an average performance improvement of over 15.5% when the message size is less than 64KB. Similarly, for MPI_Reduce, there is an average performance improvement of over 12.1% when the message is below 2KB. In addition, there is an average performance improvement of over 10% for MPI_Gather when the message ranging from 64B to 512B. For MPI_Scatter, our algorithm achieved performance improvement only for certain message sizes.

Delocalization and spin-wave dynamics in ferromagnetic chains with long-range correlated random exchange

A Public Grid Computing Framework Based on a Hierarchical Combination of Middleware

Multi-discipline computations and collaborative visualization on the campus grid

MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface

A Two-Level Parallel Decomposition Approach for Transient Stability Constrained Optimal Power Flow

Improving the Performance of MPI Applications over Computational Grid.

Efficient Direct-Connect Topologies for Collective Communications

Applying Computational Grid Technology to Power System

Network states-aware collective communication optimization

Decomposing Collectives for Exploiting Multi-lane Communication

Towards Topology-and-Trust-Aware P2P Grid.

A Topology-Aware Framework for Graph Traversals.

A Distributed Cooperative Technology for Spatial Grid Computing

An MPI-based Algorithm for Mapping Complex Networks onto Hierarchical Architectures

Automatic Discovery of Collective Communication Patterns in Parallelized Task Graphs

A Survey of Potential MPI Complex Collectives: Large-Scale Mining and Analysis of HPC Applications

NUMA-aware shared-memory collective communication for MPI

Topology-Aware Space-Shared Co-Analysis of Large-Scale Molecular Dynamics Simulations

A Topology-Adaptive Strategy for Graph Traversing

Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies

Improving the Performance and Resilience of MPI Parallel Jobs with Topology and Fault-Aware Process Placement