Abstract:When deploying deep neural networks (DNNs), edge devices training is practical to improve model adaptivity for various user-specific scenarios while avoiding privacy disclosure. However, the training computation is intolerable for edge devices. It inspires sparse DNN training (SDT) into the limelight, which reduces training computation by dynamic weight pruning. Generally, SDT has two strategies based on the pruning granularity: the structured or the unstructured. Unfortunately, both of them suffer from limited training efficiency due to the gap between pruning granularity and hardware implementation. The former is hardware-friendly but has a low pruning ratio, indicating limited computation reduction. The latter has a high pruning ratio, but the unbalanced workload decreases utilization and irregular sparsity distribution causes considerable sparsity processing overhead. This paper proposes a software-hardware co- design to bridge the gap for improving the efficiency of SDT. On the algorithm side, a sub-structured pruning method, achieved with hybrid shape-wise and line-wise pruning, generates a high sparsity ratio and keeps the hardware-friendly property. On the hardware side, a sub-structured weight processing unit (SWPU) effectively handles the hybrid sparsity with three techniques. First, SWPU dynamically reorders the computation sequence with hamming-distance-based clustering, balancing the irregular workload. Second, SWPU performs runtime scheduling by exploiting the feature of sub-structured sparse convolution through a detect-before-load controller, which skips redundant memory access and sparsity processing. Third, SWPU performs sparse convolution by compressing operands with spatial disconnect log-based routing and recovers their location with bi-directional switching, avoiding the power-consumed routing logic. Synthesized with 28nm CMOS technology, SWPU can enable 0.56V-to-1.0V supply voltage with a maximum frequency of 675 MHz. It achieves a 50.1% higher pruning ratio than structured pruning and $1.53\times $ higher energy efficiency than unstructured pruning. The peak energy efficiency of SWPU is 126.04TFLOPS/W, outperforming the state-of-the-art training processor by $1.67\times $ . When training a ResNet-18 model, SWPU reduces $3.72\times $ energy and offers $4.69\times $ speedup than previous sparse training processors.

Systolic-Array Deep-Learning Acceleration Exploring Pattern-Indexed Coordinate-Assisted Sparsity for Real-Time On-Device Speech Processing

Deep Neural Network Acceleration with Sparse Prediction Layers

A 1.6-mW Sparse Deep Learning Accelerator for Speech Separation

Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration

Efficient Binary Weight Convolutional Network Accelerator for Speech Recognition

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

High PE Utilization CNN Accelerator with Channel Fusion Supporting Pattern-Compressed Sparse Neural Networks

Sparse Periodic Systolic Dataflow for Lowering Latency and Power Dissipation of Convolutional Neural Network Accelerators

Low-power Neuromorphic Speech Recognition Engine with Coarse-Grain Sparsity.

An Image Enhancing Pattern-Based Sparsity for Real-Time Inference on Mobile Devices

Omni-sparsity DNN: Fast Sparsity Optimization for On-Device Streaming E2E ASR Via Supernet

SWPU: A 126.04 TFLOPS/W Edge-Device Sparse DNN Training Processor with Dynamic Sub-Structured Weight Pruning

Sense: Model Hardware Co-design for Accelerating Sparse CNN on Systolic Array

A Fine-Grained Sparse Accelerator for Multi-Precision DNN.

Accelerator-Aware Training for Transducer-Based Speech Recognition

Efficient Hardware Optimization Strategies For Deep Neural Networks Acceleration Chip

SDP: Co-Designing Algorithm, Dataflow, and Architecture for In-SRAM Sparse NN Acceleration

Cambricon-S: Addressing Irregularity in Sparse Neural Networks Through A Cooperative Software/Hardware Approach.

Addressing the issue of processing element under-utilization in general-purpose systolic deep learning accelerators

SPAT: FPGA-based Sparsity-Optimized Spiking Neural Network Training Accelerator with Temporal Parallel Dataflow

Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity