623 Tflop/s HPCG Run on Tianhe-2: Leveraging Millions of Hybrid Cores.

Yiqun Liu,Chao Yang,Fangfang Liu,Xianyi Zhang,Yutong Lu,Yunfei Du,Canqun Yang,Min Xie,Xiangke Liao
DOI: https://doi.org/10.1177/1094342015616266
2015-01-01
The International Journal of High Performance Computing Applications
Abstract:In this article, we present a new hybrid algorithm to enable and scale the high-performance conjugate gradients (HPCG) benchmark on large-scale heterogeneous systems such as the Tianhe-2. Based on an inner-outer subdomain partitioning strategy, the data distribution between host and device can be balanced adaptively. The overhead of data movement from both the MPI communication and the PCI-E transfer can be significantly reduced by carefully rearranging and fusing operations. A variety of parallelization and optimization techniques for performance-critical kernels are exploited and analyzed to maximize the performance gain on both host and device. We carry out experiments on both a small heterogeneous computer and the world's largest one, the Tianhe-2. On the small system, a thorough comparison and analysis has been presented to select from different optimization choices. On Tianhe-2, the optimized implementation scales to the full-system level of 3.12 million heterogeneous cores, with an aggregated performance of 623 Tflop/s and a parallel efficiency of 81.2%.
What problem does this paper attempt to address?