Abstract:High-performance computing (HPC) offers the computing power to continuously support the world's most important discoveries in various scientific and business domains such as chemistry, physics, biology, material science, drug discovery, and financial investment risk analysis. We are now in the exascale era with the Frontier exascale system that has been very recently revealed (June 2022). Researchers from across the HPC community have been developing software systems, tools, libraries, frameworks, application packages, and methods that can fully exploit these extremely powerful computing resources. Such extreme-scale computing will enable the solution of vastly more accurate predictive models and the analysis of massive quantities of data, producing quantum advances in areas of science and technology that are essential to the scientific community. New computational approaches such as machine learning/deep learning have also been heavily explored in recent years and have shown promising evidence for many problems that cannot be resolved by traditional computational simulation and engineering. Training deep neural networks with massive data is an extremely computing-intensive task that heavily relies on HPC power. The upcoming exascale computing era will be the essential basis for supporting new innovations in machine learning/deep learning-based exploration and will lead to new sciences in directions such as smart manufacturing, laboratory automation, and automatic programming. While the hardware architecture can generate extreme computing power, renovation in the software stack plays an essential role in effective performance delivery. The large supercomputers continue to move into the heterogeneous space, while the former fastest ARM-based system, Fugaku, and the many core Sunway Taihulight types of systems are marching towards the heterogeneous space. With systems equipped with GPUs, Advance RISC Machine (ARM) Support Vector Engine (SVEs), and many cores, there is a dire need for innovative software frameworks that can seamlessly migrate scientific code to these systems equipped with rich computing resources. We need innovation at different levels, including compiler tools and techniques, performance analysis tools, novel abstractions of the programming model, redesign of application-level algorithms, and so on. Furthermore, co-design of applications and low-level software frameworks can lead to more efficient use of the opportunities of exascale in many contexts. This special issue has selected 14 papers. We next present the summary of the papers presented in this special issue. The first paper titled “ParTransgrid: A scalable parallel pre-processing tool for unstructured-grid cell-centered Computational Fluid Dynamics (CFD) applications” by Jianqiang et al.1 proposes a parallel pre-processing tool, called ParTransgrid, that translates the general grid format such as the CFD General Notation System into an efficient distributed mesh data format for large-scale parallel computing. Experiment results reveal that ParTransgrid can be easily scaled to billion-level grid CFD applications and that the preparation time for parallel computing with hundreds of thousands of cores is reduced to a few minutes. The second paper titled “Generation of logic designs for efficiently solving ordinary differential equations on FPGAs” by Korch et al.2 proposes a framework that is able to automatically generate specific and optimized solver logic from easy-to-handle configuration files. No manual development and no special Field Programmable Gate Array (FPGA) or programming knowledge are required. The logic generated by this improved approach is up to 43 times faster than its hand-optimized High Level Synthesis (HLS) counterpart, depending on the solution method. The third paper titled “NAS Parallel Benchmarks with Compute Unified Device Architecture (CUDA) and Beyond” by Fernandes et al.3 provides a new CUDA implementation for NASA Parallel Benchmark (NPB). The performance results have shown up to 267% improvements over the best benchmark versions available. The authors also observe the best and worst design choices concerning code size and the performance tradeoff. Lastly, the authors highlight the challenges of implementing parallel CFD applications for Graphic Procesing Unit (GPUs) and how the computations impact the GPU's behavior. The fourth paper titled, “Using Ginkgo's Memory Accessor for Improving the Accuracy of Memory-Bound Low Precision Basic Linear Algebra Subprograms (BLAS)” by Quintana-Ortí et al.4 demonstrates that memory-bound applications operating on low precision data can increase their accuracy by relying on the memory accessor to perform all arithmetic operations in high precision. In particular, the authors demonstrate that memory-bound BLAS operations (including the sparse matrix-vector product) can be re-engineered with the memory accessor and that the resulting accessor-enabled BLAS routines achieve lower rounding errors while delivering the same performance as the fast low-precision BLAS. The fifth paper titled “Three Practical Workflow Schedulers for Easy Maximum Parallelism” by Rogers5 presents a complete characterization of the minimum effective task granularity for efficient scheduler usage scenarios. A separate job scheduler is implemented for three distinct workflow patterns involved in the preparation, execution, and analysis of computational chemistry simulations. It shows unique benefits, including simplicity of design, suitability for HPC centers, short startup time, and well-understood per-task overhead. All three new tools have been shown to scale to full utilization of Summit and have been made publicly available with tests and documentation. The sixth paper titled “LLAMA: The Low-Level Abstraction for Memory Access” by Bussmann et al.6 presents the Low-Level Abstraction of Memory Access (LLAMA), a C++ library that provides such a data structure abstraction layer with example implementations for multidimensional arrays of nested, structured data. LLAMA provides fully C++-compliant methods for defining and switching custom memory layouts for user-defined data types. The library is extensible with third-party allocators. LLAMA provides a novel tool for the development of high-performance C++ applications in a heterogeneous environment. The seventh paper titled “PAS: A new powerful and simple quantum computing simulator” by Wang et al.7 proposes a new powerful and simple CPU-based quantum computing simulator: PAS (Power And Simple). Compared with existing simulators, PAS introduces four novel optimization methods: efficient hybrid vectorization, fast bitwise operation, memory access filtering, and quantum tracking. Experiments were performed on the Intel Xeon E5-2670 v3 CPU and showed that PAS compared with the state-of-the-art simulator QuEST can achieve a mean speedup of 8.69x and 2.62x for the Quantum Field Theory (QFT) and Relativistic Quantum Chemistry (RQC) benchmarks, respectively. The eighth paper titled “Dynamics Signature based Anomaly Detection” by Bader et al.8 borrows the dynamics metrics and proposes the concept of Dynamics Signature (DS) in multi-dimensional feature space to efficiently distinguish the abnormal event from the normal behaviors of a variable star. Two datasets, parameterized sinusoidal dataset containing 262,440 light curves, and a real variable star-based dataset containing 462,996 light curves are used to evaluate the practical performance of the proposed DS algorithm. Experimental results show that their DS algorithm is highly accurate, sensitive to detecting weak microlensing events at very early stages, and fast enough to process 176,000 stars in less than 1 second on a commodity computer. The ninth paper titled “EESSI: A Cross-Platform Ready-To-Use Optimised Scientific Software Stack” by Röblitz et al.9 proposes the European Environment for Scientific Software Installations project that aims to provide a ready-to-use stack of scientific software installations that can be leveraged easily on a variety of platforms, ranging from personal workstations to cloud environments and supercomputer infrastructure, without making compromises with respect to performance. The authors provide a detailed overview of the project, highlight potential use cases, and demonstrate that the performance of the provided scientific software installations can be competitive with system-specific installations. The eleventh paper titled “A large scale parallel fluid-structure interaction computing platform for simulating structural responses to a detonation shock” by Yang et al.10 presents a partitioned fluid-structure interaction computing platform designed for parallel simulating structural responses to a detonation shock. The 3D numerical result of structural responses to a detonation shock is presented and analyzed. On 256 processor cores, the speedup ratio of the simulations for a detonation shock reaches 178.0 with 5.1 million mesh cells and the parallel efficiency achieves 69.5%. The results demonstrate the good potential of massively parallel simulations. Overall, a general-purpose fluid-structure interaction software platform with detonation support is proposed by integrating open source codes. The authors express their sincere gratitude and thanks to the Editor-in-Chief Dr. Rajkumar Buyya, for guiding them to organize this special issue. The authors appreciate the support from the editorial office. The authors are also thankful to all the authors who submitted their ideas to this special issue and to the reviewers for their thoughtful and critical suggestions to improve the quality of the submitted papers.

ParaStack: Efficient Hang Detection for MPI Programs at Large Scale

Testing and Runtime Support for MPI Applications ∗

A Two-Level Parallel Decomposition Approach for Transient Stability Constrained Optimal Power Flow

Time-sharing Parallel Applications Through Performance-Targeted Feedback-Controlled Real-Time Scheduling.

A Stack-Centric Processing Model for Iterative Processing

Enabling Practical Transparent Checkpointing for MPI: A Topological Sort Approach

A Case Study of Designing Efficient Algorithm-based Fault Tolerant Application for Exascale Parallelism

Building a Fault Tolerant Application Using the GASPI Communication Layer

Proposal of MPI Operation Level Checkpoint/Rollback and One Implementation

Building algorithmically nonstop fault tolerant MPI programs

Identifying Scalability Bottlenecks for Large-Scale Parallel Programs with Graph Analysis

A Dynamic Data Partition Algorithm Oriented to MPI and OpenMP1

Jdebug: A Fast, Non-intrusive and Scalable Fault Locating Tool for Ten-Million-Scale Parallel Applications

Non-intrusively Avoiding Scaling Problems in and out of MPI Collectives

Performance Evaluation of an Algorithm-based Asynchronous Checkpoint-Restart Fault Tolerant Application Using Mixed MPI/GPI-2

NO2: Speeding Up Parallel Processing of Massive Compute-Intensive Tasks

Domain-specific Pattern Matching Based Automatic Parallelization: Demonstrated by 2-D Prestack Migration

Utilizing the Multi-threading Techniques to Improve the Two-Level Checkpoint/Rollback System for MPI Applications

Special Issue on New Trends in High-Performance Computing: Software Systems and Applications

Distributed asynchronous convergence detection without detection protocol

Leveraging Graph Analysis to Pinpoint Root Causes of Scalability Issues for Parallel Applications