Abstract:Sparse general matrix–matrix multiplication (SpGEMM) is a crucial and complex computational task in many practical applications. Improving the performance of SpGEMM on SIMT processors like modern GPUs is challenging due to the unpredictable sparsity of sparse matrices. Although existing GPU solutions have made progress in improving performance through advanced algorithm design, they ignore some optimizations related to specific processor architectures. This can result in a partially inefficient implementation of their algorithms. This paper focuses on optimizing four inefficient parts of the NSparse algorithm on DCU (a GPU-like accelerator). The optimizations include: 1) setting parameters to improve the load balance of the second matrix by extracting maximum row information at runtime; 2) reducing overhead of binning operations by making full use of registers and shared memory effectively; 3) improving numerical SpGEMM performance by adjusting its calculation mode; and 4) enhancing global load balance through finer-grained grouping and kernel configurations. Experiment results demonstrate that when compared to five state-of-the-art SpGEMM algorithms (bhSparse, KokkosKernels, NSparse, rocSparse, and spECK), our optimized method achieves an average of 7.99x (up to 18.2x), 8.01x (up to 20.83x), 2.37x (up to 6.16x), 1.82x (up to 4.20x), and 1.63x (up to 5.01x) speedups on 29 sparse matrices with different sparse structures, respectively.

Optimizations on Sparse Matrix-Vector Multiplication Based on CUDA

Optimizing sparse matrix-vector multiplication based on gpu

Parallel optimization for sparse matrix-vector on GPU

Improvement of Sparse Matrix-Vector Multiplication on GPU

Performance Modeling and Optimization of Sparse Matrix-Vector Multiplication on NVIDIA CUDA Platform

Optimizing Algorithm of Sparse Linear Systems on GPU

Sparse Matrix-Vector Multiplication Optimizations based on Matrix Bandwidth Reduction using NVIDIA CUDA

Design and Implementation of Matrix Multiplication on GPU

Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM

CUDA-based PCG algorithm optimization for a large sparse matrix

Efficient Algorithm Design of Optimizing SpMV on GPU.

High Performance Matrix Multiplication on General Purpose Graphics Processing Units

Multicore-Based Performance Optimization For Evaluating The Inverse Of Sparse Matrix

Optimizing sparse general matrix–matrix multiplication for DCUs

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Multicore-Based Performance Optimization For Dense Matrix Computation

Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors

An Optimized GP-GPU Warp Scheduling Algorithm for Sparse Matrix-Vector Multiplication

Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format

Improving Performance of Matrix Multiplication and FFT on GPU

Performance Optimization for Sparse A(T)Ax in Parallel on Multicore Cpu