Abstract:Model compression is an important technique to facilitate efficient embedded and hardware implementations of deep neural networks (DNNs), a number of prior works are dedicated to model compression techniques. The target is to simultaneously reduce the model storage size and accelerate the computation, with minor effect on accuracy. Two important categories of DNN model compression techniques are weight pruning and weight quantization. The former leverages the redundancy in the number of weights, whereas the latter leverages the redundancy in bit representation of weights. These two sources of redundancy can be combined, thereby leading to a higher degree of DNN model compression. However, a systematic framework of joint weight pruning and quantization of DNNs is lacking, thereby limiting the available model compression ratio. Moreover, the computation reduction, energy efficiency improvement, and hardware performance overhead need to be accounted besides simply model size reduction, and the hardware performance overhead resulted from weight pruning method needs to be taken into consideration. To address these limitations, we present ADMM-NN, the first algorithm-hardware co-optimization framework of DNNs using Alternating Direction Method of Multipliers (ADMM), a powerful technique to solve non-convex optimization problems with possibly combinatorial constraints. The first part of ADMM-NN is a systematic, joint framework of DNN weight pruning and quantization using ADMM. It can be understood as a smart regularization technique with regularization target dynamically updated in each ADMM iteration, thereby resulting in higher performance in model compression than the state-of-the-art. The second part is hardware-aware DNN optimizations to facilitate hardware-level implementations. We perform ADMM-based weight pruning and quantization considering (i) the computation reduction and energy efficiency improvement, and (ii) the hardware performance overhead due to irregular sparsity. The first requirement prioritizes the convolutional layer compression over fully-connected layers, while the latter requires a concept of the break-even pruning ratio, defined as the minimum pruning ratio of a specific layer that results in no hardware performance degradation. Without accuracy loss, ADMM-NN achieves 85× and 24× pruning on LeNet-5 and AlexNet models, respectively, --- significantly higher than the state-of-the-art. The improvements become more significant when focusing on computation reduction. Combining weight pruning and quantization, we achieve 1,910× and 231× reductions in overall model size on these two benchmarks, when focusing on data storage. Highly promising results are also observed on other representative DNNs such as VGGNet and ResNet-50. We release codes and models at https://github.com/yeshaokai/admm-nn.

An Ultra-Efficient Memristor-Based DNN Framework with Structured Weight Pruning and Quantization Using ADMM

Tiny but Accurate: A Pruned, Quantized and Optimized Memristor Crossbar Framework for Ultra Efficient DNN Implementation

Reliable Memristor-based Neuromorphic Design Using Variation- and Defect-Aware Training

Aging Aware Retraining for Memristor-based Neuromorphic Computing

ADMM-NN: An Algorithm-Hardware Co-Design Framework of DNNs Using Alternating Direction Method of Multipliers

ADMM-NN

A Unified Framework of DNN Weight Pruning and Weight Clustering/Quantization Using ADMM

Progressive DNN Compression: A Key to Achieve Ultra-High Weight Pruning and Quantization Rates using ADMM

Pruning and quantization algorithm with applications in memristor-based convolutional neural network

Bulk-Switching Memristor-Based Compute-In-Memory Module for Deep Neural Network Training

Efficient Micro-Structured Weight Unification and Pruning for Neural Network Compression

Memristive Quantized Neural Networks: A Novel Approach to Accelerate Deep Learning On-Chip.

Crossbar-Aligned & Integer-Only Neural Network Compression for Efficient In-Memory Acceleration

QuantBayes: Weight Optimization for Memristive Neural Networks via Quantization-Aware Bayesian Inference

A Memristor-Based Processing-in-Memory Architecture for Deep Convolutional Neural Networks Approximate Computation

BETTER: Bayesian-Based Training and Lightweight Transfer Architecture for Reliable and High-Speed Memristor Neural Network Deployment

Improving DNN Fault Tolerance Using Weight Pruning and Differential Crossbar Mapping for ReRAM-based Edge AI

ResNet Can Be Pruned 60x: Introducing Network Purification and Unused Path Removal (P-RM) after Weight Pruning

Reweighted Alternating Direction Method of Multipliers for DNN weight pruning

Reduction 93.7% time and power consumption using a memristor-based imprecise gradient update algorithm

StructADMM: A Systematic, High-Efficiency Framework of Structured Weight Pruning for DNNs