Abstract:Stochastic gradient descent (SGD) is widely used for training generalized linear models (GLMs), such as support vector machine and logistic regression, on large industry datasets. Such a training consumes plenty of computing power and therefore plenty of accelerators are proposed to accelerate the GLM training. However, real-world datasets are always highly sparse. For example, YouTube’s social network connectivity contains only 2.31% non-zero elements. It is not trivial to design an accelerator that is able to efficiently train on a sparse dataset that is stored in a compressed sparse format (e.g., Compressed Sparse Row (CSR) format). The design of such an accelerator faces three challenges: (1) bank conflicts, which may happen when multiple processing engines in the accelerator access multiple memory banks; (2) complex interconnections, which are necessary to allow all processing engines to access any memory bank; and (3) high synchronization overhead, since each sample in sparse dataset has a different number of non-zero elements and these elements have different distributions, thus it is hard to overlap gradient computation and model update of neighboring batches. To this end, we propose SparseACC, a sparsity-aware accelerator for training generalized linear models. SparseACC is based on two key mechanisms. First, a software/hardware co-design approach solves the first two design challenges by proposing a novel bank-conflict-free and bank-balanced CSR format. Second, a weight-aware ping-pong model solves the third challenge, thus maximizing the utilization of the processing engines. SparseACC leverages these two mechanisms to orchestrate training over sparse datasets, such that the training time decreases linearly with the sparsity of the dataset. We prototype SparseACC on a Xilinx Alveo U280 FPGA 1. The experimental evaluation shows that SparseACC converges up to 3.5×, 18×, 38×, and 110× faster than the state-of-the-art counterparts on a sparse accelerator, a Tesla V100 GPU, an Intel i9-10900k CPU, and a dense accelerator, respectively.

SPLAT: A framework for optimised GPU code-generation for SParse reguLar ATtention

Sgap: Towards Efficient Sparse Tensor Algebra Compilation for GPU

SPSA: Exploring Sparse-Packing Computation on Systolic Arrays from Scratch

AlphaSparse: Generating High Performance SpMV Codes Directly from Sparse Matrices

Efficient Algorithm Design of Optimizing SpMV on GPU.

SpMMPlu: A Compiler Plug-in with Sparse IR for Efficient Sparse Matrix Multiplication.

LSRB-CSR: A Low Overhead Storage Format for SpMV on the GPU Systems

SparseACC: A Generalized Linear Model Accelerator for Sparse Datasets

Auto-SpMV: Automated Optimizing SpMV Kernels on GPU

Towards General Purpose Acceleration by Exploiting Common Data-Dependence Forms

Optimization of Sparse Matrix-Vector Multiplication with Variant CSR on GPUs

Mentha: Enabling Sparse-Packing Computation on Systolic Arrays.

Implementing Sparse Matrix-Vector Multiplication with Qcsr on Gpu

Design and Implementation of Adaptive SpMV Library for Multicore and Many-Core Architecture

AG-SpTRSV: an Automatic Framework to Optimize Sparse Triangular Solve on GPUs

Smat: An Input Adaptive Auto-Tuner For Sparse Matrix-Vector Multiplication

Performance analysis and optimization for SpMV based on aligned storage formats on an ARM processor

Hardware-Software Co-Design Enabling Static and Dynamic Sparse Attention Mechanisms

Adaptive SpMV/SpMSpV on GPUs for Input Vectors of Varied Sparsity

Optimizing sparse matrix-vector multiplication based on gpu

Multi-GPU Implementation and Performance Optimization for CSR-Based Sparse Matrix-Vector Multiplication