Abstract:Stochastic gradient descent (SGD) is widely used for training generalized linear models (GLMs), such as support vector machine and logistic regression, on large industry datasets. Such a training consumes plenty of computing power and therefore plenty of accelerators are proposed to accelerate the GLM training. However, real-world datasets are always highly sparse. For example, YouTube’s social network connectivity contains only 2.31% non-zero elements. It is not trivial to design an accelerator that is able to efficiently train on a sparse dataset that is stored in a compressed sparse format (e.g., Compressed Sparse Row (CSR) format). The design of such an accelerator faces three challenges: (1) bank conflicts, which may happen when multiple processing engines in the accelerator access multiple memory banks; (2) complex interconnections, which are necessary to allow all processing engines to access any memory bank; and (3) high synchronization overhead, since each sample in sparse dataset has a different number of non-zero elements and these elements have different distributions, thus it is hard to overlap gradient computation and model update of neighboring batches. To this end, we propose SparseACC, a sparsity-aware accelerator for training generalized linear models. SparseACC is based on two key mechanisms. First, a software/hardware co-design approach solves the first two design challenges by proposing a novel bank-conflict-free and bank-balanced CSR format. Second, a weight-aware ping-pong model solves the third challenge, thus maximizing the utilization of the processing engines. SparseACC leverages these two mechanisms to orchestrate training over sparse datasets, such that the training time decreases linearly with the sparsity of the dataset. We prototype SparseACC on a Xilinx Alveo U280 FPGA 1. The experimental evaluation shows that SparseACC converges up to 3.5×, 18×, 38×, and 110× faster than the state-of-the-art counterparts on a sparse accelerator, a Tesla V100 GPU, an Intel i9-10900k CPU, and a dense accelerator, respectively.

Efficient SpMM Accelerator for Deep Learning: Sparkle and Its Automated Generator

SparseACC: A Generalized Linear Model Accelerator for Sparse Datasets

Algorithm/Hardware Co-Optimization for Sparsity-Aware SpMM Acceleration of GNNs

SDMA: an Efficient and Flexible Sparse-Dense Matrix-Multiplication Architecture for GNNs.

Spada: Accelerating Sparse Matrix Multiplication with Adaptive Dataflow.

FullSparse: A Sparse-Aware GEMM Accelerator with Online Sparsity Prediction

FPGA-Based Sparse Matrix Multiplication Accelerators: From State-of-the-art to Future Opportunities

Accelerating Unstructured SpGEMM using Structured In-situ Computing

BafSP: Co-Design of Compute SRAM and Bit-Aware Data Flip Mitigation with In-Memory Sparsity Detection for SpMM

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Misam: Using ML in Dataflow Selection of Sparse-Sparse Matrix Multiplication

FireFly-S: Exploiting Dual-Side Sparsity for Spiking Neural Networks Acceleration with Reconfigurable Spatial Architecture

Spiker+: a framework for the generation of efficient Spiking Neural Networks FPGA accelerators for inference at the edge

Esspmv: an Embedded-FPGA-based Hardware Accelerator for Symmetric Sparse Matrix-Vector Multiplication.

SPAT: FPGA-based Sparsity-Optimized Spiking Neural Network Training Accelerator with Temporal Parallel Dataflow

Fast Sparse Deep Neural Network Inference with Flexible SpMM Optimization Space Exploration

PULSE: Parametric Hardware Units for Low-power Sparsity-Aware Convolution Engine

SDP: Co-Designing Algorithm, Dataflow, and Architecture for In-SRAM Sparse NN Acceleration

Efficient Sparse Matrix Kernels based on Adaptive Workload-Balancing and Parallel-Reduction

SpaceA: Sparse Matrix Vector Multiplication on Processing-in-Memory Accelerator

An Efficient Spiking Neural Network Accelerator with Sparse Weight.