Abstract:The sparse matrix-vector multiplication (SpMV) is of great importance in scientific computations. Graphics processing unit (GPU)-accelerated SpMVs for large-sized problems have attracted considerable attention recently. We observe that on a specific multi-GPU platform, the SpMV performance can usually be greatly improved when a matrix is partitioned into several blocks according to a predetermined rule and each block is assigned to a GPU with an appropriate storage format. This motivates us to propose a novel multi-GPU parallel SpMV optimization framework, which involves the following parts: (1) a simple rule is defined to divide any given matrix among multiple GPUs; (2) a performance model, which is independent of the problems and dependent on the resources of devices, is proposed to accurately predict the execution time of SpMV kernels; and (3) a selection algorithm is suggested to automatically select the most appropriate one from the storage formats that are involved in the framework for the matrix block that is assigned to each GPU on the basis of the performance model. The objective of our framework does not construct a new storage format or algorithm but automatically and rapidly generates an optimally parallel SpMV for any sparse matrix on a specific multi-GPU platform by integrating the existing storage formats and their corresponding kernels. We take 5 popular storage formats, for example, to present the idea of constructing the framework. Theoretically, we validate the correctness of our proposed SpMV performance model. This model is constructed only once for each type of GPU. Moreover, this framework is general and easy to be extensible. For a storage format that is not included in our framework, once the performance model of its corresponding SpMV kernel is successfully constructed, it can be incorporated into our framework. The experiments validate the efficiency of our proposed framework.

A Cross-Platform SpMV Framework on Many-Core Architectures.

Tpspmv: A Two-Phase Large-Scale Sparse Matrix-Vector Multiplication Kernel for Manycore Architectures

Scale-Free Sparse Matrix-Vector Multiplication on Many-Core Architectures

Towards Efficient SpMV on Sunway Manycore Architectures.

Efficient Algorithm Design of Optimizing SpMV on GPU.

Design and Implementation of Adaptive SpMV Library for Multicore and Many-Core Architecture

Accelerating Sparse Matrix Vector Multiplication on Many-Core GPUs

Automatic Tuning of Sparse Matrix-Vector Multiplication on Multicore Clusters.

Towards Large-Scale Sparse Matrix-Vector Multiplication on the SW26010 Manycore Architecture.

Exploring Better Speculation and Data Locality in Sparse Matrix-Vector Multiplication on Intel Xeon

A Simple and Efficient Storage Format for SIMD-accelerated SpMV

An Integral-equation-oriented Vectorized SpMV Algorithm and Its Application on CT Imaging Reconstruction

ALBUS: A Method for Efficiently Processing SpMV Using SIMD and Load Balancing

Yet Another Hybrid Strategy for Auto-tuning SpMV on GPUs

A Data Locality-Aware Design Framework For Reconfigurable Sparse Matrix-Vector Multiplication Kernel

Efficiently Running SpMV on Multi-core DSPs for Banded Matrix

A novel multi-graphics processing unit parallel optimization framework for the sparse matrix-vector multiplication.

Optimization of Sparse Matrix-Vector Multiplication with Variant CSR on GPUs

Multi-GPU Implementation and Performance Optimization for CSR-Based Sparse Matrix-Vector Multiplication

A Comprehensive Performance Model of Sparse Matrix-Vector Multiplication to Guide Kernel Optimization

CSR2: A New Format for SIMD-accelerated SpMV.