Abstract:Convolutional neural networks (CNNs) have achieved remarkable success in various computer vision tasks, which are extremely powerful to deal with massive training data by using tens of millions of parameters. However, CNNs often cost significant memory and computation consumption, which prohibits their usage in resource-limited environments such as mobile or embedded devices. To address the above issues, the existing approaches typically focus on either accelerating the convolutional layers or compressing the fully-connected layers separatedly, without pursuing a joint optimum. In this paper, we overcome such a limitation by introducing a holistic CNN compression framework, termed LRDKT, which works throughout both convolutional and fully-connected layers. First, a low-rank decomposition (LRD) scheme is proposed to remove redundancies across both convolutional kernels and fully-connected matrices, which has a novel closed-form solver to significantly improve the efficiency of the existing iterative optimization solvers. Second, a novel knowledge transfer (KT) based training scheme is introduced. To recover the accumulated accuracy loss and overcome the vanishing gradient, KT explicitly aligns outputs and intermediate responses from a teacher (original) network to its student (compressed) network. We have comprehensively analyzed and evaluated the compression and speedup ratios of the proposed model on MNIST and ILSVRC 2012 benchmarks. In both benchmarks, the proposed scheme has demonstrated superior performance gains over the state-of-the-art methods. We also demonstrate the proposed compression scheme for the task of transfer learning, including domain adaptation and object detection, which show exciting performance gains over the state-of-the-arts. Our source code and compressed models are available at https://github.com/ShaohuiLin/LRDKT.

CNN Acceleration by Low-rank Approximation with Quantized Factors

Single-shot Pruning and Quantization for Hardware-Friendly Neural Network Acceleration

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

Focused Quantization for Sparse CNNs

Speeding-up and compression convolutional neural networks by low-rank decomposition without fine-tuning

Holistic CNN Compression Via Low-Rank Decomposition with Knowledge Transfer.

Sensitivity-based Acceleration and Compression Algorithm for Convolution Neural Network.

Transform Quantization for CNN (Convolutional Neural Network) Compression

Convolutional neural networks compression with low rank and sparse tensor decompositions

Fixed-point Quantization of Convolutional Neural Networks for Quantized Inference on Embedded Platforms

Low-precision CNN Model Quantization based on Optimal Scaling Factor Estimation

Deep neural network compression by Tucker decomposition with nonlinear response

Compressing CNN Kernels for Videos Using Tucker Decompositions: Towards Lightweight CNN Applications

Joint Optimization of Dimension Reduction and Mixed-Precision Quantization for Activation Compression of Neural Networks

Quantized Convolutional Neural Networks for Mobile Devices

Compressing Deep Convolutional Networks using Vector Quantization

A Survey of Model Compression and Acceleration for Deep Neural Networks.

Post-Training Non-Uniform Quantization for Convolutional Neural Networks

Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications

Reduced storage direct tensor ring decomposition for convolutional neural networks compression

Recent Advances in Convolutional Neural Network Acceleration