xMath2.0: a high-performance extended math library for SW26010-Pro many-core processor
Fangfang Liu,Wenjing Ma,Yuwen Zhao,Daokun Chen,Yi Hu,Qinglin Lu,WanWang Yin,Xinhui Yuan,Lijuan Jiang,Hao Yan,Min Li,Hongsen Wang,Xinyu Wang,Chao Yang
DOI: https://doi.org/10.1007/s42514-022-00126-8
2022-10-19
CCF Transactions on High Performance Computing
Abstract:High performance extended math library is used by many scientific engineering and artificial intelligence applications, which usually involves many common mathematical computations and the most time-consuming functions. In order to take full advantage of the high performance processors, these functions need to be parallelized and optimized intensively. It is common for processor vendors to supply highly optimized commercial math library. For example, Intel maintains oneMKL, and NVIDIA has cuBLAS, cuSolver, and cuFFT. In this paper, we release a new-generation high-performance extended math library, xMath 2.0, specifically designed for the SW26010-Pro many-core processor, which includes four major modules:BLAS, LAPACK, FFT, and SPARSE. Each module is optimized for the domestic SW26010-Pro processor, leveraging parallelization on the many-core CPE mesh and optimization techniques such as assembly instruction rearrangement and computation-communication overlapping. In xMath2.0, the BLAS module has an average performance increase of 146.02 times over the MPE version of GotoBLAS2, and the performance of BLAS level 3 functions has increased by 393.95 times. The LAPACK module (calling xMath BLAS) is 233.44 times better than LAPACK (calling GotoBLAS2). And the FFT module is 47.63 times faster than FFTW3.3.2. The library has been deployed on the domestic Sunway TaihuLight Pro supercomputer, which have been used by dozens of users.