Enabling and Scaling the HPCG Benchmark on the Newest Generation Sunway Supercomputer with 42 Million Heterogeneous Cores

Qianchao Zhu,Hao Luo,Chao Yang,Mingshuo Ding,Wanwang Yin,Xinhui Yuan
DOI: https://doi.org/10.1145/3458817.3476158
2021-01-01
Abstract:We study and evaluate performance optimization techniques for the HPCG benchmark on the newest generation Sunway supercomputer. Specifically, a two -level blocking scheme is proposed to expose adequate parallelism in the symmetric Gauss-Seidel kernel while keeping a fast convergence rate, a fine-grained kernel fusion technique is developed to alleviate the bandwidth load on local storage with small capacity, and a low overhead thread collaboration method is presented to efficiently move data between threads and hide its cost with data transfer operations. Test results show that the optimized HPCG code is able to exploit 73.0% of the theoretical memory bandwidth, and scale to over 42 million heterogeneous cores with 95.5% weak-scaling efficiency and 5.91 Pflop/s performance. We also study how the performance can be improved if the specific rules of HPCG are not fully obeyed, and design dependency preserving parallelization and vectorization methods, further boosting performance to 27.6 Pflop/s.
What problem does this paper attempt to address?