Abstract:Stencil computation arises from a large variety of scientific and engineering applications and often plays a critical role in the performance of extreme-scale simulations. Due to the memory bound nature, it is a challenging task to optimize stencil computation kernels on many leadership supercomputers, such as Sunway TaihuLight, which has relatively high computing throughput whilst relatively low data-moving capability. In this white paper, we show the efforts we have been making during the past two years in developing end-to-end implementation and optimization techniques for extreme-scale stencil computations on Sunway TaihuLight. We started with a work on optimizing the 3-D 2nd-order 13-point stencil for nonhydrostatic atmospheric dynamics simulation, which is an important part of the 2016 ACM Gordon Bell Prize winning work, and extended it in ways that can handle a broader range of realistic and challenging problems, such as the HPGMG benchmark that consists of memory-hungry stencils and the gaseous wave detonation simulation that relies on complex high-order stencils. The presented stencil computation paradigm on Sunway TaihuLight includes not only multilevel parallelization to exploit the parallelism on different hardware levels, but also systematic performance optimization techniques for communication, memory access, and computation. We show by extreme-scale tests that the proposed systematic stencil computation paradigm can successfully deliver remarkable performance on Sunway TaihuLight with ten million heterogeneous cores. In particular, we achieve an aggregate performance of 23.12 Pflops for the 3-D 5th-order WENO stencil computation in gaseous wave detonation simulation, which is the highest performance result for high-order stencil computations as far as we know, and an aggregate performance of solving over one trillion unknowns per second in the HPGMG benchmark, which ranks the first place in the HPGMG List of Nov 2017.

Scaling Graph 500 SSSP to 140 Trillion Edges with over 40 Million Cores

Scaling Graph Traversal to 281 Trillion Edges with 40 Million Cores

An Edge-Fencing Strategy for Optimizing SSSP Computations on Large-Scale Graphs

ShenTu: Processing Multi-Trillion Edge Graphs on Millions of Cores in Seconds

Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores

TianheGraph: Customizing Graph Search for Graph500 on Tianhe Supercomputer

Enabling and Scaling the HPCG Benchmark on the Newest Generation Sunway Supercomputer with 42 Million Heterogeneous Cores

UniDegree: A GPU-Based Graph Representation for SSSP.

Research on Performance Optimization for Large-Scale Sparse Computation over Many-Core Heterogenous Supercomputer

fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms

Fast Sparse Deep Neural Network Inference with Flexible SpMM Optimization Space Exploration

A Hierarchical Tridiagonal System Solver for Heterogenous Supercomputers

Swsptrsv

A Fast Sparse Triangular Solver for Structured-grid Problems on Sunway Many-core Processor SW26010

Highly Efficient Breadth-First Search on CPU-Based Single-Node System

K-Core Decomposition on Super Large Graphs with Limited Resources

NXgraph: an Efficient Graph Processing System on a Single Machine

Customizing Graph500 for Tianhe Pre-exacale system

Fast All-Pairs Shortest Paths Algorithm in Large Sparse Graph

Extreme-Scale Realistic Stencil Computations on Sunway TaihuLight with Ten Million Cores

GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning