Abstract:The International Journal of High Performance Computing Applications, Ahead of Print. As HPC system architectures and the applications running on them continue to evolve, the MPI standard itself must evolve. The trend in current and future HPC systems toward powerful nodes with multiple CPU cores and multiple GPU accelerators makes efficient support for hybrid programming critical for applications to achieve high performance. However, the support for hybrid programming in the MPI standard has not kept up with recent trends. The MPICH implementation of MPI provides a platform for implementing and experimenting with new proposals and extensions to fill this gap and to gain valuable experience and feedback before the MPI Forum can consider them for standardization. In this work, we detail six extensions implemented in MPICH to increase MPI interoperability with other runtimes, with a specific focus on heterogeneous architectures. First, the extension to MPI generalized requests lets applications integrate asynchronous tasks into MPI's progress engine. Second, the iovec extension to datatypes lets applications use MPI datatypes as a general-purpose data layout API beyond just MPI communications. Third, a new MPI object, MPIX_Stream, can be used by applications to identify execution contexts beyond MPI processes, including threads and GPU streams. MPIX stream communicators can be created to make existing MPI functions thread-aware and GPU-aware, thus providing applications with explicit ways to achieve higher performance. Fourth, MPIX Streams are extended to support the enqueue semantics for offloading MPI communications onto a GPU stream context. Fifth, thread communicators allow MPI communicators to be constructed with individual threads, thus providing a new level of interoperability between MPI and on-node runtimes such as OpenMP. Lastly, we present an extension to invoke MPI progress, which lets users spawn progress threads with fine-grained control to adapt the communication performance to their application designs. We describe the design and implementation of these extensions, provide usage examples, and highlight their expected benefits with performance results.

Extending $$\tau $$-Lop to Model MPI Blocking Primitives on Shared Memory

C-Lop: Accurate Contention-Based Modeling of MPI Concurrent Communication

Hybrid Performance Modeling And Analyzing Of Parallel Systems

DataMPI: Extending MPI to Hadoop-Like Big Data Computing

A Numerical Model Oriented Large-scale Parallel I/O Optimization Method.

A Conservative Time Management Model with Optimized Degree of Parallelism for Distributed Simulation

LogGPO: an Accurate Communication Model for Performance Prediction of MPI Programs

OpenMP Compiler for Distributed Memory Architectures

mPlogP: A Parallel Computation Model for Heterogeneous Multi-core Computer

LLAMP: Assessing Network Latency Tolerance of HPC Applications with Linear Programming

Performance Modeling for MPI Applications with Low Overhead Fine-Grained Profiling.

MPI Progress For All

A time management optimization framework for large-scale distributed hardware-in-the-loop simulation.

Designing and prototyping extensions to the Message Passing Interface in MPICH

A Communication- and Memory-Aware Model for Load Balancing Tasks

HmPlogP: a hierarchical computation model for heterogeneous multi-core parallel systems

LogSC: Model-based One-Sided Communication Performance Estimation

Extended Overhead Analysis For Openmp Performance Tuning

An Optimized Error-controlled MPI Collective Framework Integrated with Lossy Compression

The extension of OpenMP parallel programming model to support transactional memory execution