Abstract:The tensor cores in modern GPUs lead to significant performance improvement in matrix multiplication, which is the primary operation in deep learning. However, existing hardware architectures face unstructured sparsity in deep learning, resulting in algorithm inflexibility and hardware inefficiency. The previous tensor core architecture requires matrices to be pruned into 2:4 sparse patterns, leading to algorithm inflexibility. Customized accelerators introduce extra architectures (e.g., interconnection networks for dynamic data routing or buffers for avoiding data conflicts) for unstructured sparse matrices, leading to hardware inefficiency. To tackle the contradiction between algorithm inflexibility and hardware inefficiency, we propose Two-level Sparsity Tensor Core (TSTC) in this paper. TSTC points out that the unstructured sparsity which enables algorithm flexibility can be maintained at the coarse-grained level, while hardware efficiency which requires structured sparsity can be ensured at the fine-grained level. For algorithm flexibility, we propose Flexible Sparse Block (FSB) pattern. FSB enables unstructured sparse matrices can be divided into fine-grained blocks with different structured sparsity. As a result, using FSB leads to up to 7.29x speed up compared with other formats. For hardware efficiency, we propose Dynamic Extendible Reduction Network (DERN). DERN enables different structured sparse reductions by only extending the data width on the standard reduction network without introducing interconnections or buffers. DERN enables TSTC to achieve 7.19x more energy savings under a similar speed. We also propose the whole flow, which can automatically deploy different sparse deep learning algorithms to TSTC. According to extensive experiments, TSTC achieves 1.24 x ~7.69 x speedup and 3.68 x~4.17 x energy savings than the tensor core and the SOTA customized accelerator.

High-Performance Tensor Learning Primitives Using GPU Tensor Cores

High-Performance Tensor-Train Primitives Using GPU Tensor Cores

Scalable CP Decomposition for Tensor Learning using GPU Tensor Cores

cuFastTuckerPlus: A Stochastic Parallel Sparse FastTucker Decomposition Using GPU Tensor Cores

cuFasterTucker: A Stochastic Optimization Strategy for Parallel Sparse FastTucker Decomposition on GPU Platform

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

Hardware-Enabled Efficient Data Processing with Tensor-Train Decomposition

A-Tucker: Fast Input-Adaptive and Matricization-Free Tucker Decomposition of Higher-Order Tensors on GPUs

A Novel Parallel Algorithm for Sparse Tensor Matrix Chain Multiplication via TCU-Acceleration

Hardware-Efficient Mixed-Precision CP Tensor Decomposition

Tucker Tensor Decomposition on FPGA

Accelerating Large Language Model Training with Hybrid GPU-based Compression

TensorCache: Reconstructing Memory Architecture with SRAM-Based In-Cache Computing for Efficient Tensor Computations in GPGPUs

a-Tucker: Input-Adaptive and Matricization-Free Tucker Decomposition for Dense Tensors on CPUs and GPUs

TC-GNN: Bridging Sparse GNN Computation and Dense Tensor Cores on GPUs

TSTC: Two-Level Sparsity Tensor Core Enabling Both Algorithm Flexibility and Hardware Efficiency

Tensorized NeuroEvolution of Augmenting Topologies for GPU Acceleration

Speeding Up Deep Convolutional Neural Networks Based on Tucker-CP Decomposition

Sparse Tucker Tensor Decomposition on a Hybrid FPGA-CPU Platform

GTCO: Graph and Tensor Co-Design for Transformer-Based Image Recognition on Tensor Cores

Mixed-TD: Efficient Neural Network Accelerator with Layer-Specific Tensor Decomposition