Abstract:Deep learning models have evolved into powerful tools that can be used for many artificial intelligence tasks. However, deploying deep neural networks into real-world applications is still challenging due to their high computational complexity and storage overhead. Fortunately, a densely connected neural network can be converted into a sparsely connected network with low resource demand by the neural network compression. Since deep neural networks are complicated, compression mechanism should find a tradeoff between compression ratio and model accuracy. In this article, by analyzing the statistics of channel connection, we propose an interactive neural network compression mechanism including out-in-channel pruning and neural network quantization. Many channel pruning works apply structured sparsity regularization on each layer separately. We consider correlations between successive layers to retain predictive power of the compact network. A global greedy pruning algorithm is designed to remove redundant out-in-channels in an iterative way. Moreover, in order to solve the shortcomings of the one-shot quantization, we propose the incremental quantization algorithm in the dimension of the output channel, which can smooth network fluctuations and recover accuracy better during retraining. Our mechanism is comprehensively evaluated with various Convolutional Neural Networks (CNN) architectures on popular datasets. Notably, on ImageNet-1K, the out-in-channel pruning reduce 54.0 percent FLOPS on AlexNet and 50.0 percent FLOPs on ResNet-50 with only 0.15 and 0.37 percent top-1 accuracy drop respectively. On classification and style transfer tasks, the superiority of incremental quantization increases with the decrease of the number of quantization bits.

Adaptive Layerwise Quantization for Deep Neural Network Compression

Deep Neural Network Compression With Single and Multiple Level Quantization

Improved Model Compression Method Based on Information Entropy

Weight Normalization based Quantization for Deep Neural Network Compression

Instance-Aware Dynamic Neural Network Quantization

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Weighted-Entropy-Based Quantization for Deep Neural Networks

Channel-Level Variable Quantization Network for Deep Image Compression

LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks

Quantization without Tears

Space Efficient Quantization for Deep Convolutional Neural Networks

Focused Quantization for Sparse CNNs

Deep Network Quantization via Error Compensation

An Inter-Layer Weight Prediction and Quantization for Deep Neural Networks based on a Smoothly Varying Weight Hypothesis

Residual Quantization for Low Bit-Width Neural Networks

Joint Optimization of Dimension Reduction and Mixed-Precision Quantization for Activation Compression of Neural Networks

Efficient Neural Compression with Inference-time Decoding

Weightless: Lossy Weight Encoding For Deep Neural Network Compression

Learning Low Resource Consumption CNN through Pruning and Quantization

Towards the Limit of Network Quantization

"Lossless" Compression of Deep Neural Networks: A High-dimensional Neural Tangent Kernel Approach