A Hierarchical Grid Algorithm for Accelerating High-Performance Conjugate Gradient Benchmark on Sunway Many-Core Processor

Chenzhi Liao,Junshi Chen,Wenting Han,Huanqi Cao,Zhichao Su,Wanwang Yin,Hong An
DOI: https://doi.org/10.1145/3162957.3163049
2017-01-01
Abstract:This paper presents analysis and optimizations for High Performance Conjugate Gradient benchmark (HPCG) on the Sunway many-core processor. For modern multi-core and manycore processors, HPCG always presents a poor performance and under-utilizes computation resource because of its low arithmetic intensity and fine-grain parallelism. We apply two conventional methods to parallel Gauss-Seidel smoother the most time consumer kernel in HPCG, including Level-Scheduling (LS) and Multi-Coloring (MC). These strategies are effective and achieve 1.54x and 5.52x performance improvement. For overcoming the poor locality for MC and limited parallelism for LS, we propose a novel Hierarchical Grid (HG) algorithm and our algorithmic and architecture-aware optimizations achieve an aggregated performance of 3.54 Gflops, which is around 0.475% of the peak performance and 15.4x higher than reference on the single coregroup of SW26010 processor. With MPI parallelize, we balance the parallelism, pre-processing, convergence rate and communication overheads, we achieved 192 TFlops (70% parallelization efficiency) when scaling to 81920 CGs (5,324,800 cores) on Sunway Taihulight System. Moreover, we analyze the adaptability of our parallel method and optimization strategies and summarize several key points when refactoring and optimizing HPC applications on the Sunway heterogeneous many-core processor.
What problem does this paper attempt to address?