Abstract:Sparse matrix-vector multiplication (SpMV) is an essential linear algebra operation that dominates the computing cost in many scientific applications. Due to providing massive parallelism and high memory bandwidth, GPUs are commonly used to accelerate SpMV kernels. Prior studies mainly focused on reducing the latency of SpMV kernels on GPU. However, few attempts have been made to improve the energy efficiency of SpMV kernels, resulting in GPUs being excluded from the range of low-power applications. Furthermore, prior work has primarily focused on optimizing the sparse format of SpMV kernels, the literature ignores evaluating the impact of tweaking compilation parameters. Lastly, Little attention has been paid to preparing a comprehensive training dataset of running SpMV kernels and fine-tuning the learning hyperparameters. To address these limitations, we present a novel framework, dubbed Auto-SpMV, that enables energy-efficient and low-latency SpMV kernels on GPU. To achieve the best run time performance, Auto-SpMV proposes two optimization modes: compile-time and run-time. In the compile-time mode, Auto-SpMV tweaks the compilation parameters, while in the run-time mode, Auto-SpMV selects the best sparse format for the sparse input matrix. To achieve the best classification results, 1) we collect the largest dataset ever having 30 different sparse matrices running with more than 15K different configurations, and 2) we boost classification models by automatically fine-tuning the learning hyperparameters. Experimental results reveal that Auto-SpMV optimizes latency, energy consumption, average power, and energy efficiency in the compile-time mode by up to 51.9%, 52%, 33.2%, and 53%, respectively, compared to the default setting. Auto-SpMV optimizes average power and energy efficiency in the run-time mode by up to 34.6% and 99.7%, respectively, compared to the default setting.

FastLoad: Speeding Up Data Loading of Both Sparse Matrix and Vector for SpMV on GPUs

Efficient Algorithm Design of Optimizing SpMV on GPU.

TileSpMV: A Tiled Algorithm for Sparse Matrix-Vector Multiplication on GPUs

TileSpMSpV: A Tiled Algorithm for Sparse Matrix-Sparse Vector Multiplication on GPUs

Optimizing sparse matrix-vector multiplication based on gpu

AMF-CSR: Adaptive Multi-Row Folding of CSR for SpMV on GPU.

Efficient sparse-matrix multi-vector product on GPUs

Auto-SpMV: Automated Optimizing SpMV Kernels on GPU

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

TaiChi: A Hybrid Compression Format for Binary Sparse Matrix-Vector Multiplication on GPU

Parallel optimization for sparse matrix-vector on GPU

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Atomic Reduction Based Sparse Matrix-Transpose Vector Multiplication on GPUs

Efficient Sparse Matrix Kernels based on Adaptive Workload-Balancing and Parallel-Reduction

Improvement of Sparse Matrix-Vector Multiplication on GPU

Sparse matrix partitioning for optimizing SpMV on CPU-GPU heterogeneous platforms

LSRB-CSR: A Low Overhead Storage Format for SpMV on the GPU Systems

A Comprehensive Performance Model of Sparse Matrix-Vector Multiplication to Guide Kernel Optimization

Sparse Matrix-Vector Multiplication Optimizations based on Matrix Bandwidth Reduction using NVIDIA CUDA

Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors

Accelerating approximate matrix multiplication for near-sparse matrices on GPUs