Abstract:High-performance computing (HPC) offers the computing power to continuously support the world's most important discoveries in various scientific and business domains such as chemistry, physics, biology, material science, drug discovery, and financial investment risk analysis. We are now in the exascale era with the Frontier exascale system that has been very recently revealed (June 2022). Researchers from across the HPC community have been developing software systems, tools, libraries, frameworks, application packages, and methods that can fully exploit these extremely powerful computing resources. Such extreme-scale computing will enable the solution of vastly more accurate predictive models and the analysis of massive quantities of data, producing quantum advances in areas of science and technology that are essential to the scientific community. New computational approaches such as machine learning/deep learning have also been heavily explored in recent years and have shown promising evidence for many problems that cannot be resolved by traditional computational simulation and engineering. Training deep neural networks with massive data is an extremely computing-intensive task that heavily relies on HPC power. The upcoming exascale computing era will be the essential basis for supporting new innovations in machine learning/deep learning-based exploration and will lead to new sciences in directions such as smart manufacturing, laboratory automation, and automatic programming. While the hardware architecture can generate extreme computing power, renovation in the software stack plays an essential role in effective performance delivery. The large supercomputers continue to move into the heterogeneous space, while the former fastest ARM-based system, Fugaku, and the many core Sunway Taihulight types of systems are marching towards the heterogeneous space. With systems equipped with GPUs, Advance RISC Machine (ARM) Support Vector Engine (SVEs), and many cores, there is a dire need for innovative software frameworks that can seamlessly migrate scientific code to these systems equipped with rich computing resources. We need innovation at different levels, including compiler tools and techniques, performance analysis tools, novel abstractions of the programming model, redesign of application-level algorithms, and so on. Furthermore, co-design of applications and low-level software frameworks can lead to more efficient use of the opportunities of exascale in many contexts. This special issue has selected 14 papers. We next present the summary of the papers presented in this special issue. The first paper titled “ParTransgrid: A scalable parallel pre-processing tool for unstructured-grid cell-centered Computational Fluid Dynamics (CFD) applications” by Jianqiang et al.1 proposes a parallel pre-processing tool, called ParTransgrid, that translates the general grid format such as the CFD General Notation System into an efficient distributed mesh data format for large-scale parallel computing. Experiment results reveal that ParTransgrid can be easily scaled to billion-level grid CFD applications and that the preparation time for parallel computing with hundreds of thousands of cores is reduced to a few minutes. The second paper titled “Generation of logic designs for efficiently solving ordinary differential equations on FPGAs” by Korch et al.2 proposes a framework that is able to automatically generate specific and optimized solver logic from easy-to-handle configuration files. No manual development and no special Field Programmable Gate Array (FPGA) or programming knowledge are required. The logic generated by this improved approach is up to 43 times faster than its hand-optimized High Level Synthesis (HLS) counterpart, depending on the solution method. The third paper titled “NAS Parallel Benchmarks with Compute Unified Device Architecture (CUDA) and Beyond” by Fernandes et al.3 provides a new CUDA implementation for NASA Parallel Benchmark (NPB). The performance results have shown up to 267% improvements over the best benchmark versions available. The authors also observe the best and worst design choices concerning code size and the performance tradeoff. Lastly, the authors highlight the challenges of implementing parallel CFD applications for Graphic Procesing Unit (GPUs) and how the computations impact the GPU's behavior. The fourth paper titled, “Using Ginkgo's Memory Accessor for Improving the Accuracy of Memory-Bound Low Precision Basic Linear Algebra Subprograms (BLAS)” by Quintana-Ortí et al.4 demonstrates that memory-bound applications operating on low precision data can increase their accuracy by relying on the memory accessor to perform all arithmetic operations in high precision. In particular, the authors demonstrate that memory-bound BLAS operations (including the sparse matrix-vector product) can be re-engineered with the memory accessor and that the resulting accessor-enabled BLAS routines achieve lower rounding errors while delivering the same performance as the fast low-precision BLAS. The fifth paper titled “Three Practical Workflow Schedulers for Easy Maximum Parallelism” by Rogers5 presents a complete characterization of the minimum effective task granularity for efficient scheduler usage scenarios. A separate job scheduler is implemented for three distinct workflow patterns involved in the preparation, execution, and analysis of computational chemistry simulations. It shows unique benefits, including simplicity of design, suitability for HPC centers, short startup time, and well-understood per-task overhead. All three new tools have been shown to scale to full utilization of Summit and have been made publicly available with tests and documentation. The sixth paper titled “LLAMA: The Low-Level Abstraction for Memory Access” by Bussmann et al.6 presents the Low-Level Abstraction of Memory Access (LLAMA), a C++ library that provides such a data structure abstraction layer with example implementations for multidimensional arrays of nested, structured data. LLAMA provides fully C++-compliant methods for defining and switching custom memory layouts for user-defined data types. The library is extensible with third-party allocators. LLAMA provides a novel tool for the development of high-performance C++ applications in a heterogeneous environment. The seventh paper titled “PAS: A new powerful and simple quantum computing simulator” by Wang et al.7 proposes a new powerful and simple CPU-based quantum computing simulator: PAS (Power And Simple). Compared with existing simulators, PAS introduces four novel optimization methods: efficient hybrid vectorization, fast bitwise operation, memory access filtering, and quantum tracking. Experiments were performed on the Intel Xeon E5-2670 v3 CPU and showed that PAS compared with the state-of-the-art simulator QuEST can achieve a mean speedup of 8.69x and 2.62x for the Quantum Field Theory (QFT) and Relativistic Quantum Chemistry (RQC) benchmarks, respectively. The eighth paper titled “Dynamics Signature based Anomaly Detection” by Bader et al.8 borrows the dynamics metrics and proposes the concept of Dynamics Signature (DS) in multi-dimensional feature space to efficiently distinguish the abnormal event from the normal behaviors of a variable star. Two datasets, parameterized sinusoidal dataset containing 262,440 light curves, and a real variable star-based dataset containing 462,996 light curves are used to evaluate the practical performance of the proposed DS algorithm. Experimental results show that their DS algorithm is highly accurate, sensitive to detecting weak microlensing events at very early stages, and fast enough to process 176,000 stars in less than 1 second on a commodity computer. The ninth paper titled “EESSI: A Cross-Platform Ready-To-Use Optimised Scientific Software Stack” by Röblitz et al.9 proposes the European Environment for Scientific Software Installations project that aims to provide a ready-to-use stack of scientific software installations that can be leveraged easily on a variety of platforms, ranging from personal workstations to cloud environments and supercomputer infrastructure, without making compromises with respect to performance. The authors provide a detailed overview of the project, highlight potential use cases, and demonstrate that the performance of the provided scientific software installations can be competitive with system-specific installations. The eleventh paper titled “A large scale parallel fluid-structure interaction computing platform for simulating structural responses to a detonation shock” by Yang et al.10 presents a partitioned fluid-structure interaction computing platform designed for parallel simulating structural responses to a detonation shock. The 3D numerical result of structural responses to a detonation shock is presented and analyzed. On 256 processor cores, the speedup ratio of the simulations for a detonation shock reaches 178.0 with 5.1 million mesh cells and the parallel efficiency achieves 69.5%. The results demonstrate the good potential of massively parallel simulations. Overall, a general-purpose fluid-structure interaction software platform with detonation support is proposed by integrating open source codes. The authors express their sincere gratitude and thanks to the Editor-in-Chief Dr. Rajkumar Buyya, for guiding them to organize this special issue. The authors appreciate the support from the editorial office. The authors are also thankful to all the authors who submitted their ideas to this special issue and to the reviewers for their thoughtful and critical suggestions to improve the quality of the submitted papers.

Productivity, Portability, Performance: Data-Centric Python

Intrepydd: performance, productivity, and portability for data science application kernels

Reproducing Performance of Data-Centric Python by SCC Team From National Tsing Hua University

Performance on HPC Platforms Is Possible Without C++

Productive Performance Engineering for Weather and Climate Modeling with Python

A Portable Framework for Accelerating Stencil Computations on Modern Node Architectures

A Study of Performance Portability in Plasma Physics Simulations

Landscape of High-performance Python to Develop Data Science and Machine Learning Applications

Performance Evaluation of Python Parallel Programming Models: Charm4Py and mpi4py

Asynchronous Execution of Python Code on Task Based Runtime Systems

An approach to performance portability through generic programming

Python Workflows on HPC Systems

Advanced Python Performance Monitoring with Score-P

Portability: A Necessary Approach for Future Scientific Software

Performance Portable Monte Carlo Neutron Transport in MCDC via Numba

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs

Taking GPU Programming Models to Task for Performance Portability

Accelerating Pythonic coupled cluster implementations: a comparison between CPUs and GPUs

A Lightweight Approach to Performance Portability with targetDP

Special Issue on New Trends in High-Performance Computing: Software Systems and Applications

Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond