Abstract:Sparse matrix-vector multiplication (SpMV) is an important primitive across a wide range of application domains such as scientific computing and graph analytics. Due to its intrinsic memory-bound characteristics, the performance of SpMV on throughput-oriented architectures such as GPU is bounded by the limited bandwidth between processors and memory. Processing-in-memory (PIM) architectures, made feasible by advances in 3D stacking, provide new opportunities to utilize ultra-high bandwidth by integrating compute-logic into memory. In this paper, we develop an SpMV accelerator, named as SpaceA, based on PIM architectures. SpaceA integrates compute-logic near memory banks to exploit bank-level bandwidth. SpaceA contains both hardware and data-mapping design features to alleviate irregular memory access patterns which hinder full utilization of high memory bandwidth. In terms of hardware design features, SpaceA consists of two unique features: (1) it utilizes the capability of outstanding memory requests to hide the memory access latency to data located in non-local memory banks; (2) it integrates Content Addressable Memory (CAM) at the bank level to exploit data reuse of the input vectors. In addition, we develop a mapping scheme that partitions the sparse matrix into different memory banks, to maximize the data locality of the input vector and to achieve workload balance among processing elements (PEs) near each bank. Overall, SpaceA together with the proposed mapping method achieves 13.54x speedup and 87.49% energy saving on average over the GPU baseline on SpMV computation. In addition to SpMV primitives, we conduct a case study on graph analytics to demonstrate the benefits of SpaceA for applications built on SpMV. Compared to Tesseract and GraphP, state-of-the-art graph accelerators, SpaceA obtains better performance due to its higher effective bandwidth provided by near-bank integration.

SPSA: Exploring Sparse-Packing Computation on Systolic Arrays from Scratch

SPLAT: A framework for optimised GPU code-generation for SParse reguLar ATtention

Esspmv: an Embedded-FPGA-based Hardware Accelerator for Symmetric Sparse Matrix-Vector Multiplication.

Efficient Algorithm Design of Optimizing SpMV on GPU.

CPSAA: Accelerating Sparse Attention using Crossbar-based Processing-In-Memory Architecture

SpaceA: Sparse Matrix Vector Multiplication on Processing-in-Memory Accelerator

MMSparse: 2D Partitioning of Sparse Matrix Based on Mathematical Morphology

Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization

Sparse Periodic Systolic Dataflow for Lowering Latency and Power Dissipation of Convolutional Neural Network Accelerators

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

LSRB-CSR: A Low Overhead Storage Format for SpMV on the GPU Systems

BafSP: Co-Design of Compute SRAM and Bit-Aware Data Flip Mitigation with In-Memory Sparsity Detection for SpMM

Accelerating Unstructured SpGEMM using Structured In-situ Computing

Spada: Accelerating Sparse Matrix Multiplication with Adaptive Dataflow.

Adaptive SpMV/SpMSpV on GPUs for Input Vectors of Varied Sparsity

IOPS: An Unified SpMM Accelerator Based on Inner-Outer-Hybrid Product

SpArch: Efficient Architecture for Sparse Matrix Multiplication

Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration

High Performance Unstructured SpMM Computation Using Tensor Cores

Reconfigurable Spatial-Parallel Stochastic Computing for Accelerating Sparse Convolutional Neural Networks