Abstract:Sparse matrix-vector multiplication (SpMV) is one of the most important kernels for many applications. In this paper, we study the implementation of SpMV for scale-free matrices on many-core architectures including graphic processing units and Xeon Phi coprocessors. We first propose a hardware oblivious implementation for heterogeneous many-core processors using OpenCL. Our OpenCL implementation uses a novel SpMV format called hybrid COO+CSR (HCC), which employs 2-D jagged partitioning to balance the workload among a large number of cores and improve the data locality. Moreover, the OpenCL implementation is designed to be parametric, which allows systematic performance tuning. We conduct experiments to evaluate the efficiency of our hardware oblivious implementation. Experiments show that it achieves comparable performance to the Intel MKL and state-of-the-art OpenCL-based ViennaCL library implementation. Although the OpenCL implementation provides functional portability for heterogeneous systems, it fails to take advantage of the low-level architectural features. To further improve the performance, we propose a hardware conscious implementation using the native parallel programming language. We use the Xeon Phi platform as a case study. In our hardware conscious implementation, we ensure that the HCC format efficiently utilizes the vector process units on Xeon Phi by employing low-level intrinsics, and improve the overall performance through locality-aware block mapping, and intrablock tiling. Experiments using a wide range of representative scale-free matrices demonstrate that compared with the OpenCL-based hardware oblivious implementation, the hardware conscious implementation achieves 2.2x speedup on average. Compared with MKL, the hardware conscious implementation achieves 3.1x speedup on Xeon Phi.

Adaptive Tuning of Sparse Matrix-Vector Multiplication on Cell Architecture

Automatic Tuning of Sparse Matrix-Vector Multiplication on Multicore Clusters.

A Comprehensive Performance Model of Sparse Matrix-Vector Multiplication to Guide Kernel Optimization

Smat: An Input Adaptive Auto-Tuner For Sparse Matrix-Vector Multiplication

Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors

Performance Modeling and Optimization of Sparse Matrix-Vector Multiplication on NVIDIA CUDA Platform

Optimizing sparse matrix-vector multiplication based on gpu

Sparse Matrix-Vector Multiplication Optimizations based on Matrix Bandwidth Reduction using NVIDIA CUDA

A Data Locality-Aware Design Framework For Reconfigurable Sparse Matrix-Vector Multiplication Kernel

Improvement of Sparse Matrix-Vector Multiplication on GPU

SMAT: An Input Adaptive Sparse Matrix-Vector Multiplication Auto-Tuner

Design and Implementation of Adaptive SpMV Library for Multicore and Many-Core Architecture

Implementation and optimization of SpMV algorithm based on SW26010P many-core processor and stored in BCSR format

A lightweight optimization selection method for Sparse Matrix-Vector Multiplication

Efficient Algorithm Design of Optimizing SpMV on GPU.

Optimizing and Auto-Tuning Scale-Free Sparse Matrix-Vector Multiplication on Intel Xeon Phi

An Optimized GP-GPU Warp Scheduling Algorithm for Sparse Matrix-Vector Multiplication

Esspmv: an Embedded-FPGA-based Hardware Accelerator for Symmetric Sparse Matrix-Vector Multiplication.

Scale-Free Sparse Matrix-Vector Multiplication on Many-Core Architectures

Towards Efficient SpMV on Sunway Manycore Architectures.

Method for realizing heterogeneous many-core of sparse matrix-vector multiplication based on domestic SW26010 processors