Abstract:The tensor cores in modern GPUs lead to significant performance improvement in matrix multiplication, which is the primary operation in deep learning. However, existing hardware architectures face unstructured sparsity in deep learning, resulting in algorithm inflexibility and hardware inefficiency. The previous tensor core architecture requires matrices to be pruned into 2:4 sparse patterns, leading to algorithm inflexibility. Customized accelerators introduce extra architectures (e.g., interconnection networks for dynamic data routing or buffers for avoiding data conflicts) for unstructured sparse matrices, leading to hardware inefficiency. To tackle the contradiction between algorithm inflexibility and hardware inefficiency, we propose Two-level Sparsity Tensor Core (TSTC) in this paper. TSTC points out that the unstructured sparsity which enables algorithm flexibility can be maintained at the coarse-grained level, while hardware efficiency which requires structured sparsity can be ensured at the fine-grained level. For algorithm flexibility, we propose Flexible Sparse Block (FSB) pattern. FSB enables unstructured sparse matrices can be divided into fine-grained blocks with different structured sparsity. As a result, using FSB leads to up to 7.29x speed up compared with other formats. For hardware efficiency, we propose Dynamic Extendible Reduction Network (DERN). DERN enables different structured sparse reductions by only extending the data width on the standard reduction network without introducing interconnections or buffers. DERN enables TSTC to achieve 7.19x more energy savings under a similar speed. We also propose the whole flow, which can automatically deploy different sparse deep learning algorithms to TSTC. According to extensive experiments, TSTC achieves 1.24 x ~7.69 x speedup and 3.68 x~4.17 x energy savings than the tensor core and the SOTA customized accelerator.

Tensor Core-Adapted Sparse Matrix Multiplication for Accelerating Sparse Deep Neural Networks

A Novel Parallel Algorithm for Sparse Tensor Matrix Chain Multiplication via TCU-Acceleration

High Performance Unstructured SpMM Computation Using Tensor Cores

BCB-SpTC: An Efficient Sparse High-Dimensional Tensor Contraction Employing Tensor Core Acceleration

TSTC: Two-Level Sparsity Tensor Core Enabling Both Algorithm Flexibility and Hardware Efficiency

Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format

EC-SpMM: Efficient Compilation of SpMM Kernel on GPUs.

SpMMPlu: A Compiler Plug-in with Sparse IR for Efficient Sparse Matrix Multiplication.

Efficient Sparse Matrix Kernels based on Adaptive Workload-Balancing and Parallel-Reduction

RM-STC: Row-Merge Dataflow Inspired GPU Sparse Tensor Core for Energy-Efficient Sparse Acceleration.

Optimizing sparse general matrix–matrix multiplication for DCUs

TaiChi: A Hybrid Compression Format for Binary Sparse Matrix-Vector Multiplication on GPU

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix Multiplication

DSTC: Dual-Side Sparsity Tensor Core for DNNs Acceleration on Modern GPU Architectures

Accelerating approximate matrix multiplication for near-sparse matrices on GPUs

Scale-Free Sparse Matrix-Vector Multiplication on Many-Core Architectures

Sparse Matrix-Vector Multiplication Optimizations based on Matrix Bandwidth Reduction using NVIDIA CUDA

Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library

Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors

SpWMM: A High-Performance Sparse-Winograd Matrix-Matrix Multiplication Accelerator for CNNs.