Abstract:The tensor cores in modern GPUs lead to significant performance improvement in matrix multiplication, which is the primary operation in deep learning. However, existing hardware architectures face unstructured sparsity in deep learning, resulting in algorithm inflexibility and hardware inefficiency. The previous tensor core architecture requires matrices to be pruned into 2:4 sparse patterns, leading to algorithm inflexibility. Customized accelerators introduce extra architectures (e.g., interconnection networks for dynamic data routing or buffers for avoiding data conflicts) for unstructured sparse matrices, leading to hardware inefficiency. To tackle the contradiction between algorithm inflexibility and hardware inefficiency, we propose Two-level Sparsity Tensor Core (TSTC) in this paper. TSTC points out that the unstructured sparsity which enables algorithm flexibility can be maintained at the coarse-grained level, while hardware efficiency which requires structured sparsity can be ensured at the fine-grained level. For algorithm flexibility, we propose Flexible Sparse Block (FSB) pattern. FSB enables unstructured sparse matrices can be divided into fine-grained blocks with different structured sparsity. As a result, using FSB leads to up to 7.29x speed up compared with other formats. For hardware efficiency, we propose Dynamic Extendible Reduction Network (DERN). DERN enables different structured sparse reductions by only extending the data width on the standard reduction network without introducing interconnections or buffers. DERN enables TSTC to achieve 7.19x more energy savings under a similar speed. We also propose the whole flow, which can automatically deploy different sparse deep learning algorithms to TSTC. According to extensive experiments, TSTC achieves 1.24 x ~7.69 x speedup and 3.68 x~4.17 x energy savings than the tensor core and the SOTA customized accelerator.

STC: Significance-aware Transform-based Codec Framework for External Memory Access Reduction

TECO: A Unified Feature Map Compression Framework Based on Transform and Entropy

Spatial-Temporal Transformer based Video Compression Framework

Memory-Efficient Compression Based on Least-Squares Fitting in Convolutional Neural Network Accelerators.

ASC: Adaptive Scale Feature Map Compression for Deep Neural Network

Memory-Efficient CNN Accelerator Based on Interlayer Feature Map Compression

TSTC: Two-Level Sparsity Tensor Core Enabling Both Algorithm Flexibility and Hardware Efficiency

An Efficient CNN Inference Accelerator Based on Intra- and Inter-Channel Feature Map Compression

A Computationally Efficient Neural Video Compression Accelerator Based on a Sparse CNN-Transformer Hybrid Network

An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks

Relative Indexed Compressed Sparse Filter Encoding Format for Hardware-Oriented Acceleration of Deep Convolutional Neural Networks

DSTC: Dual-Side Sparsity Tensor Core for DNNs Acceleration on Modern GPU Architectures

QTTNet: Quantized Tensor Train Neural Networks for 3D Object and Video Recognition.

Decomposition, Compression, and Synthesis (DCS)-based Video Coding: A Neural Exploration via Resolution-Adaptive Learning

CSWAP: A Self-Tuning Compression Framework for Accelerating Tensor Swapping in GPUs

A Method to Reduce the Intra-Frame Prediction Complexity of HEVC Based on D-CNN

UACT: A Unified Energy-efficient Computing Architecture for CNN and TCNN.

StoX-Net: Stochastic Processing of Partial Sums for Efficient In-Memory Computing DNN Accelerators

Focused Quantization for Sparse CNNs

A Streaming Accelerator for Deep Convolutional Neural Networks with Image and Feature Decomposition for Resource-limited System Applications.

A SRAM-saving Two-Stage Storage Strategy for the Coefficients Memories in HEVC Encoders.