Abstract:The sparse matrix-vector multiplication (SpMV) is of great importance in scientific computations. Graphics processing unit (GPU)-accelerated SpMVs for large-sized problems have attracted considerable attention recently. We observe that on a specific multi-GPU platform, the SpMV performance can usually be greatly improved when a matrix is partitioned into several blocks according to a predetermined rule and each block is assigned to a GPU with an appropriate storage format. This motivates us to propose a novel multi-GPU parallel SpMV optimization framework, which involves the following parts: (1) a simple rule is defined to divide any given matrix among multiple GPUs; (2) a performance model, which is independent of the problems and dependent on the resources of devices, is proposed to accurately predict the execution time of SpMV kernels; and (3) a selection algorithm is suggested to automatically select the most appropriate one from the storage formats that are involved in the framework for the matrix block that is assigned to each GPU on the basis of the performance model. The objective of our framework does not construct a new storage format or algorithm but automatically and rapidly generates an optimally parallel SpMV for any sparse matrix on a specific multi-GPU platform by integrating the existing storage formats and their corresponding kernels. We take 5 popular storage formats, for example, to present the idea of constructing the framework. Theoretically, we validate the correctness of our proposed SpMV performance model. This model is constructed only once for each type of GPU. Moreover, this framework is general and easy to be extensible. For a storage format that is not included in our framework, once the performance model of its corresponding SpMV kernel is successfully constructed, it can be incorporated into our framework. The experiments validate the efficiency of our proposed framework.

A novel data transformation and execution strategy for accelerating sparse matrix multiplication on GPUs.

A Row Decomposition-based Approach for Sparse Matrix Multiplication on GPUs.

Iterative Sparse Matrix-Vector Multiplication On In-Memory Cluster Computing Accelerated By Gpus For Big Data

Efficient Sparse Matrix Kernels based on Adaptive Workload-Balancing and Parallel-Reduction

Optimization of Large-Scale Sparse Matrix-Vector Multiplication on Multi-GPU Systems

Exploiting Online Locality and Reduction Parallelism for Sampled Dense Matrix Multiplication on GPUs

SPMSD: an Partitioning-Strategy for Parallel General Sparse Matrix-Matrix Multiplication on GPU

A sparsity-aware distributed-memory algorithm for sparse-sparse matrix multiplication

Performance Optimization Using Partitioned SpMV on GPUs and Multicore CPUs

Efficient sparse-matrix multi-vector product on GPUs

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Optimized Data Reuse via Reordering for Sparse Matrix-Vector Multiplication on FPGAs

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Jigsaw: Accelerating SpMM with Vector Sparsity on Sparse Tensor Core

A novel multi-graphics processing unit parallel optimization framework for the sparse matrix-vector multiplication.

Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format

Releasing the Potential of Tensor Core for Unstructured SpMM Using Tiled-CSR Format

Sparse matrix partitioning for optimizing SpMV on CPU-GPU heterogeneous platforms

Exploring the Design Space of Distributed Parallel Sparse Matrix-Multiple Vector Multiplication

Optimizing sparse matrix-vector multiplication based on gpu

Improvement of Sparse Matrix-Vector Multiplication on GPU