Abstract:When deploying deep neural networks (DNNs), edge devices training is practical to improve model adaptivity for various user-specific scenarios while avoiding privacy disclosure. However, the training computation is intolerable for edge devices. It inspires sparse DNN training (SDT) into the limelight, which reduces training computation by dynamic weight pruning. Generally, SDT has two strategies based on the pruning granularity: the structured or the unstructured. Unfortunately, both of them suffer from limited training efficiency due to the gap between pruning granularity and hardware implementation. The former is hardware-friendly but has a low pruning ratio, indicating limited computation reduction. The latter has a high pruning ratio, but the unbalanced workload decreases utilization and irregular sparsity distribution causes considerable sparsity processing overhead. This paper proposes a software-hardware co- design to bridge the gap for improving the efficiency of SDT. On the algorithm side, a sub-structured pruning method, achieved with hybrid shape-wise and line-wise pruning, generates a high sparsity ratio and keeps the hardware-friendly property. On the hardware side, a sub-structured weight processing unit (SWPU) effectively handles the hybrid sparsity with three techniques. First, SWPU dynamically reorders the computation sequence with hamming-distance-based clustering, balancing the irregular workload. Second, SWPU performs runtime scheduling by exploiting the feature of sub-structured sparse convolution through a detect-before-load controller, which skips redundant memory access and sparsity processing. Third, SWPU performs sparse convolution by compressing operands with spatial disconnect log-based routing and recovers their location with bi-directional switching, avoiding the power-consumed routing logic. Synthesized with 28nm CMOS technology, SWPU can enable 0.56V-to-1.0V supply voltage with a maximum frequency of 675 MHz. It achieves a 50.1% higher pruning ratio than structured pruning and $1.53\times $ higher energy efficiency than unstructured pruning. The peak energy efficiency of SWPU is 126.04TFLOPS/W, outperforming the state-of-the-art training processor by $1.67\times $ . When training a ResNet-18 model, SWPU reduces $3.72\times $ energy and offers $4.69\times $ speedup than previous sparse training processors.

PL-NPU: an Energy-Efficient Edge-Device DNN Training Processor with Posit-Based Logarithm-Domain Computing

A 28nm 276.55TFLOPS/W Sparse Deep-Neural-Network Training Processor with Implicit Redundancy Speculation and Batch Normalization Reformulation

A 3.89-Gops/mw Scalable Recurrent Neural Network Processor with Improved Efficiency on Memory and Computation

An Energy-Efficient Deep Belief Network Processor Based on Heterogeneous Multi-Core Architecture With Transposable Memory and On-Chip Learning

Trainer: an Energy-Efficient Edge-Device Training Processor Supporting Dynamic Weight Pruning

A Precision-Scalable RISC-V DNN Processor with On-Device Learning Capability at the Extreme Edge

LPE: Logarithm Posit Processing Element for Energy-Efficient Edge-Device Training

A 1.06-to-5.09 TOPS/W Reconfigurable Hybrid-Neural-Network Processor for Deep Learning Applications

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

SWPU: A 126.04 TFLOPS/W Edge-Device Sparse DNN Training Processor with Dynamic Sub-Structured Weight Pruning

A High Energy-Efficiency Multi-core Neuromorphic Architecture for Deep SNN Training

CPE: an Energy-Efficient Edge-Device Training with Multi-dimensional Compression Mechanism

An Energy-Efficient Reconfigurable Processor for Binary-and Ternary-Weight Neural Networks With Flexible Data Bit Width

ANP-I: A 28-nm 1.5-pJ/SOP Asynchronous Spiking Neural Network Processor Enabling Sub-0.1-$\mu $J/Sample On-Chip Learning for Edge-AI Applications

A 0.99-to-4.38 Uj/class Event-Driven Hybrid Neural Network Processor for Full-Spectrum Neural Signal Analyses

ANP-I: A 28-Nm 1.5-Pj/sop Asynchronous Spiking Neural Network Processor Enabling Sub-0.1-μJ/Sample On-Chip Learning for Edge-AI Applications

Ifpna: A Flexible and Efficient Deep Learning Processor in 28-Nm CMOS Using a Domain-Specific Instruction Set and Reconfigurable Fabric.

An Ultra-High Energy-Efficient Reconfigurable Processor for Deep Neural Networks with Binary/Ternary Weights in 28NM CMOS

7.5 A 65nm 0.39-to-140.3tops/w 1-to-12b Unified Neural Network Processor Using Block-Circulant-Enabled Transpose-Domain Acceleration with 8.1 × Higher TOPS/mm2and 6T HBST-TRAM-Based 2D Data-Reuse Architecture

ANP-I: A 28-nm 1.5-pJ/SOP Asynchronous Spiking Neural Network Processor Enabling Sub-0.1-<inline-formula> <tex-math notation="LaTeX">$\mu $</tex-math> </inline-formula>J/Sample On-Chip Learning for Edge-AI Applications

22.6 ANP-I: A 28nm 1.5pJ/SOP Asynchronous Spiking Neural Network Processor Enabling Sub-O.1 μJ/Sample On-Chip Learning for Edge-AI Applications