Abstract:Sparse matrix-vector multiplication (SpMV) is one of the most important kernels for many applications. In this paper, we study the implementation of SpMV for scale-free matrices on many-core architectures including graphic processing units and Xeon Phi coprocessors. We first propose a hardware oblivious implementation for heterogeneous many-core processors using OpenCL. Our OpenCL implementation uses a novel SpMV format called hybrid COO+CSR (HCC), which employs 2-D jagged partitioning to balance the workload among a large number of cores and improve the data locality. Moreover, the OpenCL implementation is designed to be parametric, which allows systematic performance tuning. We conduct experiments to evaluate the efficiency of our hardware oblivious implementation. Experiments show that it achieves comparable performance to the Intel MKL and state-of-the-art OpenCL-based ViennaCL library implementation. Although the OpenCL implementation provides functional portability for heterogeneous systems, it fails to take advantage of the low-level architectural features. To further improve the performance, we propose a hardware conscious implementation using the native parallel programming language. We use the Xeon Phi platform as a case study. In our hardware conscious implementation, we ensure that the HCC format efficiently utilizes the vector process units on Xeon Phi by employing low-level intrinsics, and improve the overall performance through locality-aware block mapping, and intrablock tiling. Experiments using a wide range of representative scale-free matrices demonstrate that compared with the OpenCL-based hardware oblivious implementation, the hardware conscious implementation achieves 2.2x speedup on average. Compared with MKL, the hardware conscious implementation achieves 3.1x speedup on Xeon Phi.

Performance Portability of Sparse Block Diagonal Matrix Multiple Vector Multiplications on GPUs

Taking GPU Programming Models to Task for Performance Portability

Efficient sparse-matrix multi-vector product on GPUs

Application of performance portability solutions for GPUs and many-core CPUs to track reconstruction kernels

Evaluating performance portability of five shared-memory programming models using a high-order unstructured CFD solver

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Optimizing sparse matrix-vector multiplication based on gpu

A Study of Performance Portability in Plasma Physics Simulations

Feature-based SpMV Performance Analysis on Contemporary Devices

Scale-Free Sparse Matrix-Vector Multiplication on Many-Core Architectures

Heuristic Adaptability to Input Dynamics for SpMM on GPUs

Parallel optimization for sparse matrix-vector on GPU

A New Sparse Matrix Vector Multiplication GPU Algorithm Designed for Finite Element Problems

Performance Evaluations of Multiple GPUs based on MPI Environments

Implementing Performance Portability of High Performance Computing Programs in the New Golden Age of Chip Architecture

Improvement of Sparse Matrix-Vector Multiplication on GPU

Performance Modeling and Optimization of Sparse Matrix-Vector Multiplication on NVIDIA CUDA Platform

Performance Portable Monte Carlo Particle Transport on Intel, NVIDIA, and AMD GPUs

Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators

Performance and Power Efficient Massive Parallel Computational Model for HPC Heterogeneous Exascale Systems

Sparse matrix partitioning for optimizing SpMV on CPU-GPU heterogeneous platforms