Abstract:<p>Embedded devices are common carriers for deploying inference networks, which leverage the customized accelerator to achieve the promised performance with strict resource constraints. In the inference of DNN, the sparsity existing in the activations and weights of every layer contributes massive non-effictive memory accesses and computing operations. The data compression is adopted as a data pruning methed for accelerator design, which eliminates the zero-valued data with a specific data packaging method. However, the data compression, in varying degrees, breaks the data regularity of the processing array DNN accelerators caculates with. The complexity of data access caused by irregular data organization will add extra control logic and decoding logic to compensate.</p><p>The accelerator architecture that supports sparsity can use the sophisticated memory access scheming and parallel on-chip decoder structure via an efficient data packaging method to balance memory-accessing and computing for acceleration. In this paper, we propose a flexible and highly parallel accelerator architecture that uses a quantitative data packaging method which is efficient and stable for different degree of sparsity and parallel optimization to explore the sparsity in DNN to achieve high performance with low energy consumption. The total DRAM accesses, performance and energy consumption of the proposed sparse architecture are evaluated with different inference networks. Experiments show that the DRAM accesses of the proposed efficient data packaging method is significantly lower than other commonly used sparse data compression storage methods, the improved performance and saved energy of the sparse accelerator architecture after adopting the optimization method proposed in this paper are up to 1.2x and 1.6x, respectively, over a comparably provisioned do not support sparsity accelerator. In addition, the accelerator architecture proposed has achieved energy efficiency and performance improvements of up to 1.70x and 1.56x,compared with the state-of-the-art architectures.</p>

Efficient Processing of Sparse Tensor Decomposition via Unified Abstraction and PE-Interactive Architecture

FEASTA: A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine Learning

Software for Sparse Tensor Decomposition on Emerging Computing Architectures

TSTC: Two-Level Sparsity Tensor Core Enabling Both Algorithm Flexibility and Hardware Efficiency

Abstracting Sparse DNN Acceleration via Structured Sparse Tensor Decomposition

FLAASH: Flexible Accelerator Architecture for Sparse High-Order Tensor Contraction

A Novel Parallel Algorithm for Sparse Tensor Matrix Chain Multiplication via TCU-Acceleration

Hardware-Enabled Efficient Data Processing with Tensor-Train Decomposition

Balancing memory-accessing and computing over sparse DNN accelerator via efficient data packaging

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

Efficient Hardware Accelerator Based on Medium Granularity Dataflow for SpTRSV

DSTC: Dual-Side Sparsity Tensor Core for DNNs Acceleration on Modern GPU Architectures

Swtensor: Accelerating Tensor Decomposition on Sunway Architecture

Sparse Tucker Tensor Decomposition on a Hybrid FPGA-CPU Platform

Sgap: Towards Efficient Sparse Tensor Algebra Compilation for GPU

Mixed-TD: Efficient Neural Network Accelerator with Layer-Specific Tensor Decomposition

Efficient Utilization of Multi-Threading Parallelism on Heterogeneous Systems for Sparse Tensor Contraction

BCB-SpTC: An Efficient Sparse High-Dimensional Tensor Contraction Employing Tensor Core Acceleration

EN-T: Optimizing Tensor Computing Engines Performance via Encoder-Based Methodology

An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs