An efficient sparse stiffness matrix vector multiplication using compressed sparse row storage format on AMD GPU

Longyue Xing,Zhaoshun Wang,Zhezhao Ding,Genshen Chu,Lingyu Dong,Nan Xiao
DOI: https://doi.org/10.1002/cpe.7186
2022-07-20
Concurrency and Computation: Practice and Experience
Abstract:Summary The performance of sparse stiffness matrix‐vector multiplication is essential for large‐scale structural mechanics numerical simulation. Compressed sparse row (CSR) is the most common format for storing sparse stiffness matrices. However, the high sparsity of the sparse stiffness matrix makes the number of nonzero elements per row very small. Therefore, the CSR‐scalar algorithm, light algorithm, and HOLA algorithm in the calculation will cause some threads in the GPU to be in idle state, which will not only affect the computing performance but also waste computing resources. In this article, a new algorithm, CSR‐vector row, is proposed for fine‐grained computing optimization based on the AMD GPU architecture on heterogeneous supercomputers. This algorithm can set a vector to calculate a row based on the number of nonzero elements of the stiffness matrix. CSR‐vector row has efficient reduce operations, deep memory access optimization, better memory access, and calculation overlapping kernel function configuration scheme. The access bandwidth of the algorithm on AMD GPU is more than 700 GB/s. Compared with CSR‐scalar algorithm, the parallel efficiency of CSR‐vector row is improved by 7.2 times. And floating‐point computing performance is 41%–95% higher than that of light algorithm and HOLA algorithm. In addition, CSR‐vector row is used to calculate the examples from CFD, electromagnetics, quantum chemistry, power network, and semiconductor process, the memory access bandwidth and double floating‐point performance are also improved compared with rocSPARSE‐CSR‐vector.
What problem does this paper attempt to address?