Abstract:Stencil computation arises from a large variety of scientific and engineering applications and often plays a critical role in the performance of extreme-scale simulations. Due to the memory bound nature, it is a challenging task to optimize stencil computation kernels on many leadership supercomputers, such as Sunway TaihuLight, which has relatively high computing throughput whilst relatively low data-moving capability. In this white paper, we show the efforts we have been making during the past two years in developing end-to-end implementation and optimization techniques for extreme-scale stencil computations on Sunway TaihuLight. We started with a work on optimizing the 3-D 2nd-order 13-point stencil for nonhydrostatic atmospheric dynamics simulation, which is an important part of the 2016 ACM Gordon Bell Prize winning work, and extended it in ways that can handle a broader range of realistic and challenging problems, such as the HPGMG benchmark that consists of memory-hungry stencils and the gaseous wave detonation simulation that relies on complex high-order stencils. The presented stencil computation paradigm on Sunway TaihuLight includes not only multilevel parallelization to exploit the parallelism on different hardware levels, but also systematic performance optimization techniques for communication, memory access, and computation. We show by extreme-scale tests that the proposed systematic stencil computation paradigm can successfully deliver remarkable performance on Sunway TaihuLight with ten million heterogeneous cores. In particular, we achieve an aggregate performance of 23.12 Pflops for the 3-D 5th-order WENO stencil computation in gaseous wave detonation simulation, which is the highest performance result for high-order stencil computations as far as we know, and an aggregate performance of solving over one trillion unknowns per second in the HPGMG benchmark, which ranks the first place in the HPGMG List of Nov 2017.

Evaluating multi-core and many-core architectures through accelerating the three-dimensional Lax-Wendroff correction stencil

Accelerating the 3D Elastic Wave Forward Modeling on GPU and MIC

Scaling and analyzing the stencil performance on multi-core and many-core architectures

Performance Modeling of Stencil Computation on SW26010 Processors

Parallelized Implementation of the Finite Particle Method for Explicit Dynamics in GPU

Cache-Friendly Design for Complex Spatially-Variable Coefficient Stencils on Many-Core Architectures

Extreme-Scale Realistic Stencil Computations on Sunway TaihuLight with Ten Million Cores

26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight.

Optimizing Complex Spatially-Variant Coefficient Stencils for Seismic Modeling on GPU

HW/SW Co-Optimization for Stencil Computation: Beginning with a Customizable Core

Evaluation of Programming Models and Performance for Stencil Computation on Current GPU Architectures

GPU Support for Automatic Generation of Finite-Differences Stencil Kernels

Multicore-optimized wavefront diamond blocking for optimizing stencil updates

Generalized Gpu Acceleration For Applications Employing Finite-Volume Methods

Balancing Cpu And Gpu: Real-Time Visualization Of Large Scale 3d Scanning Models

Optimizing Three-Dimensional Stencil-Operations on Heterogeneous Computing Environments

A Low Overhead Heterogeneous Parallel Optimization Method Based on 3-D Elastic Wave Numerical Simulation

A Low Overhead Heterogeneous Parallel Optimization Method Based on Three-Dimensional Elastic Wave Numerical Simulation

Stencil Computations on AMD and Nvidia Graphics Processors: Performance and Tuning Strategies

Extreme-Scale High-Order WENO Simulations of 3-D Detonation Wave with 10 Million Cores

A Portable Framework for Accelerating Stencil Computations on Modern Node Architectures