Efficient dense matrix-vector multiplication on GPU.

Guixia He,Jiaquan Gao,Jun Wang
DOI: https://doi.org/10.1002/cpe.4705
2018-01-01
Abstract:Given that the dense matrix-vector multiplication (Ax or A(T)x) is of great importance in scientific computations, how to accelerate it is investigated on the graphics processing unit (GPU) in this paper. We present a warp-based implementation of Ax on the GPU, called GEMV-Adaptive, and a thread-based implementation of A(T)x on the GPU, called GEMV-T-Adaptive. For our proposed GEMV-Adaptive and GEMV-T-Adaptive, there are the following novelties: (1) an adaptive warp allocation strategy for GEMV-Adaptive is proposed to assign the optimal warp number for each matrix row, (2) an adaptive thread allocation strategy for GEMV-T-Adaptive is designed to assign the optimal thread number to each matrix row, and (3) several optimization schemes are formulated. Experimental results show that the proposed GEMV-Adaptive and GEMV-T-Adaptive mitigate the performance fluctuations of the implementations in the CUBLAS library, always have high performance, and outperform the most recently proposed GEMV and GEMV-T kernels by Gao et al, respectively, for all test matrices.
What problem does this paper attempt to address?