Abstract:The ever-growing parameter size and computation cost of Convolutional Neural Network (CNN) models hinder their deployment onto resource-constrained platforms. Network pruning techniques are proposed to remove the redundancy in CNN parameters and produce a sparse model. Sparse-aware accelerators are also proposed to reduce the computation cost and memory bandwidth requirements of inference by leveraging the model sparsity. The irregularity of sparse patterns, however, limits the efficiency of those designs. Researchers proposed to address this issue by creating a regular sparsity pattern through hardware-aware pruning algorithms. However, the pruning rate of these solutions is largely limited by the enforced sparsity patterns. This limitation motivates us to explore other compression methods beyond pruning. With two decoupled computation stages, we found that kernel decomposition could potentially take the processing of the sparse pattern off from the critical path of inference and achieve a high compression ratio without enforcing the sparse patterns. To exploit these advantages, we propose ESCALATE, an algorithm-hardware co-design approach based on kernel decomposition. At algorithm level, ESCALATE reorganizes the two computation stages of the decomposed convolution to enable a stream processing of the intermediate feature map. We proposed a hybrid quantization to exploit the different reuse frequency of each part of the decomposed weight. At architecture level, ESCALATE proposes a novel ‘Basis-First’ dataflow and its corresponding microarchitecture design to maximize the benefits brought by the decomposed convolution. We evaluate ESCALATE with four representative CNN models on both CIFAR-10 and ImageNet datasets and compare it against previous sparse accelerators and pruning algorithms. Results show that ESCALATE can achieve up to 325 × and 11 × compression ratio for models on CIFAR-10 and ImageNet, respectively. Comparing with previous dense and sparse accelerators, ESCALATE accelerator averagely boosts the energy efficiency by 8.3 × and 3.77 ×, and reduces the latency by 17.9 × and 2.16 ×, respectively.

FLAASH: Flexible Accelerator Architecture for Sparse High-Order Tensor Contraction

FEASTA: A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine Learning

Efficient Processing of Sparse Tensor Decomposition via Unified Abstraction and PE-Interactive Architecture

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

SOFA: A Compute-Memory Optimized Sparsity Accelerator via Cross-Stage Coordinated Tiling

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Swift: High-Performance Sparse Tensor Contraction for Scientific Applications

Abstracting Sparse DNN Acceleration via Structured Sparse Tensor Decomposition

SparseACC: A Generalized Linear Model Accelerator for Sparse Datasets

TSTC: Two-Level Sparsity Tensor Core Enabling Both Algorithm Flexibility and Hardware Efficiency

Balancing memory-accessing and computing over sparse DNN accelerator via efficient data packaging

Extending Sparse Tensor Accelerators to Support Multiple Compression Formats

SPLAT: A framework for optimised GPU code-generation for SParse reguLar ATtention

An Efficient and Flexible Accelerator Design for Sparse Convolutional Neural Networks

SparseAccelerate: Efficient Long-Context Inference for Mid-Range GPUs

An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs

Towards General Purpose Acceleration by Exploiting Common Data-Dependence Forms

SparGD: A Sparse GEMM Accelerator with Dynamic Dataflow

Work-in-Progress: A High-performance FPGA Accelerator for Sparse Neural Networks

ESCALATE: Boosting the Efficiency of Sparse CNN Accelerator with Kernel Decomposition

Spada: Accelerating Sparse Matrix Multiplication with Adaptive Dataflow.