Abstract:Sparse matrix-vector multiplication (SpMV) is one of the most important kernels for many applications. In this paper, we study the implementation of SpMV for scale-free matrices on many-core architectures including graphic processing units and Xeon Phi coprocessors. We first propose a hardware oblivious implementation for heterogeneous many-core processors using OpenCL. Our OpenCL implementation uses a novel SpMV format called hybrid COO+CSR (HCC), which employs 2-D jagged partitioning to balance the workload among a large number of cores and improve the data locality. Moreover, the OpenCL implementation is designed to be parametric, which allows systematic performance tuning. We conduct experiments to evaluate the efficiency of our hardware oblivious implementation. Experiments show that it achieves comparable performance to the Intel MKL and state-of-the-art OpenCL-based ViennaCL library implementation. Although the OpenCL implementation provides functional portability for heterogeneous systems, it fails to take advantage of the low-level architectural features. To further improve the performance, we propose a hardware conscious implementation using the native parallel programming language. We use the Xeon Phi platform as a case study. In our hardware conscious implementation, we ensure that the HCC format efficiently utilizes the vector process units on Xeon Phi by employing low-level intrinsics, and improve the overall performance through locality-aware block mapping, and intrablock tiling. Experiments using a wide range of representative scale-free matrices demonstrate that compared with the OpenCL-based hardware oblivious implementation, the hardware conscious implementation achieves 2.2x speedup on average. Compared with MKL, the hardware conscious implementation achieves 3.1x speedup on Xeon Phi.

A Data Locality-Aware Design Framework For Reconfigurable Sparse Matrix-Vector Multiplication Kernel

Esspmv: an Embedded-FPGA-based Hardware Accelerator for Symmetric Sparse Matrix-Vector Multiplication.

Scale-Free Sparse Matrix-Vector Multiplication on Many-Core Architectures

A Comprehensive Performance Model of Sparse Matrix-Vector Multiplication to Guide Kernel Optimization

Design and Implementation of Adaptive SpMV Library for Multicore and Many-Core Architecture

Efficient Algorithm Design of Optimizing SpMV on GPU.

A sparse matrix vector multiplication accelerator based on high-bandwidth memory

Towards Efficient SpMV on Sunway Manycore Architectures.

ReDESK: A Reconfigurable Dataflow Engine for Sparse Kernels on Heterogeneous Platforms.

Automatic Tuning of Sparse Matrix-Vector Multiplication on Multicore Clusters.

Improvement of Sparse Matrix-Vector Multiplication on GPU

A Novel Fully Hardware-Implemented SVD Solver Based on Ultra-Parallel BCV Jacobi Algorithm

Efficient Sparse Matrix Kernels based on Adaptive Workload-Balancing and Parallel-Reduction

Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors

Explicit caching HYB: a new high-performance SpMV framework on GPGPU

SPC5: an efficient SpMV framework vectorized using ARM SVE and x86 AVX-512

Near-Memory Parallel Indexing and Coalescing: Enabling Highly Efficient Indirect Access for SpMV

Feature-based SpMV Performance Analysis on Contemporary Devices

Sparse matrix partitioning for optimizing SpMV on CPU-GPU heterogeneous platforms

Computing the sparse matrix vector product using block-based kernels without zero padding on processors with AVX-512 instructions