PETSc's Heterogeneous Parallel Algorithm Design and Performance Optimization on the Sunway TaihuLight System
Wen-Jie HONG,Ken-Li LI,Zhe QUAN,Wang-Dong YANG,Ke-Qin LI,Zi-Yu HAO,Xiang-Hui XIE
DOI: https://doi.org/10.11897/SP.J.1016.2017.02057
2017-01-01
Abstract:Large-scale scientific and engineering calculations such as hydrodynamic calculations,numerical weather forecasting,seismic data processing,genetic engineering,and high-dimensional differential equations are facing with the big performance challenge.Meanwhile,the High Performance Computing (HPC) platform has been significantly developed in recent years.The appearances of multi-core processors and heterogeneous computing platforms dramatically improve the performance of high-performance applications.To fully utilize the computing power of HPC systems,it is necessary to develop specific methodologies to optimize the performance of applications based on the system architecture.The Sunway TaihuLight supercomputer is presently ranked in the TOPS00 list as the fastest supercomputer in the world,with a LINPACK benchmark rating of 93 petaflops.The Sunway TaihuLight uses a total of 40960 Chinese designed SW26010 multi-core 64-bit RISC processors.Portable,Extensible Toolkit for Scientific Computation (PETSc),an indispensable module of high performance computing,is one of basic algorithm libraries widely applied in many high-performance applications.Meanwhile,PETSc is also widely used in partial differential equations,sparse linear algebra and other related problems.The performance of PETSc directly affects the efficiency of applications invoking PETSc.In this paper,we use two most typical cases in PETSc according to actual research needs,that is ex5 (solving problems of linear systems on single node) and ex19 (solving problems of 2D driving cavity on multi nodes) to perform them on the Sunway TaihuLight supercomputer.With the analysis of experimental results,we figure out there are seven core functions including vector calculations and matrix calculations.First of all,for each core function,we do an in-depth research of its characteristics,parallel difficulties,optimizations for the bottlenecks.And then,we determine an appropriate heterogeneous parallel model for these functions on the SW26010 processor (there a total of four heterogeneous parallel model on the Sunway Taihulight).Finally,we figure out the best division strategy for task,determine the size of the data transferred,and design the parallel algorithm on the Sunway TaihuLight supercomputer.Furthermore,a series of novel performance optimization strategies is proposed according to the heterogeneous architecture of the Sunway TaihuLight system.These optimization methods mainly include the access optimization,eliminating data dependency and vectorization optimization.As the experimental results shown in this paper,our parallel algorithms of the seven core functions achieve the maximum speed up to 16.4 on one single node (contains 4 MPEs and 256 CPEs).In the case of run on multiple nodes,the acceleration ratio reaches 32 on 8192 nodes compared to 256 node s,when the input data scale is up to 16384.Besides,the speedup presents an linear tendency with the increasing number of processors.This paper demonstrates that our parallel algorithms of PETSc have good scalability,reliability and security on the Sunway TaihuLight supercomputer,which provides the reference for the similar researches.