Abstract:Sparse matrix-vector multiplication (SpMV) plays a critical role in a wide range of linear algebra computations, particularly in scientific and engineering disciplines. However, the irregular memory access patterns, extensive memory usage, high bandwidth requirements, and underutilization of parallelism hinder the computational efficiency of SpMV on GPUs. In this paper, we propose a novel approach called block-wise dynamic mixed-precision (BDMP) to address these challenges. Our methodology involves partitioning the original matrix into uniformly sized blocks, with each block's size determined by considering architectural characteristics and accuracy requirements. Additionally, we dynamically assign precision to each block using a precision selection method that takes into account the value distribution of the original sparse matrix. We develop two distinct SpMV computation algorithms for BDMP: BDMP-PBP (Precision-based partitioning) and BDMP-TCKI (Tailored compression and kernel implementation). BDMP-PBP partitions the matrix into two independent matrices for separate computations based on block precision, offering flexibility for integration with other optimization techniques. Meanwhile, BDMP-TCKI focuses on achieving significant thread-level parallelism and memory utilization by tailoring an appropriate compressed storage format and kernel implementation for each block. We compare BDMP with NVIDIA's cuSPARSE library and three state-of-the-art SpMV methods, including SELLP, MergeBase, and BalanceCSR, using matrices from the University of Florida's SuiteSparse dataset collection. BDMP-PBP and BDMP-TCKI show average speedups up to 2.64 and 2.91 on Turing RTX 2080Ti, and up to 2.99 and 3.22 on Ampere A100. The results demonstrate that BDMP enables the optimization of computation speed without compromising the precision necessary for reliable results.

An Effective SPMV Based on Block Strategy and Hybrid Compression on GPU

Efficient Algorithm Design of Optimizing SpMV on GPU.

Optimizing sparse matrix-vector multiplication based on gpu

Predicting the Output Structure of Sparse Matrix Multiplication with Sampled Compression Ratio

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Implementation and optimization of SpMV algorithm based on SW26010P many-core processor and stored in BCSR format

Parallel optimization for sparse matrix-vector on GPU

LSRB-CSR: A Low Overhead Storage Format for SpMV on the GPU Systems

An Optimized GP-GPU Warp Scheduling Algorithm for Sparse Matrix-Vector Multiplication

TaiChi: A Hybrid Compression Format for Binary Sparse Matrix-Vector Multiplication on GPU

Improvement of Sparse Matrix-Vector Multiplication on GPU

Sparse matrix partitioning for optimizing SpMV on CPU-GPU heterogeneous platforms

FPGA and GPU Implementation of Large Scale SpMV

An Integral-equation-oriented Vectorized SpMV Algorithm and Its Application on CT Imaging Reconstruction

FastLoad: Speeding Up Data Loading of Both Sparse Matrix and Vector for SpMV on GPUs

Computing the sparse matrix vector product using block-based kernels without zero padding on processors with AVX-512 instructions

AMF-CSR: Adaptive Multi-Row Folding of CSR for SpMV on GPU.

A Novel Fully Hardware-Implemented SVD Solver Based on Ultra-Parallel BCV Jacobi Algorithm

TileSpMV: A Tiled Algorithm for Sparse Matrix-Vector Multiplication on GPUs

A New Sparse Matrix Vector Multiplication GPU Algorithm Designed for Finite Element Problems

TileSpMSpV: A Tiled Algorithm for Sparse Matrix-Sparse Vector Multiplication on GPUs