Abstract:Deep convolutional neural networks have a large number of parameters and require a significant number of floating-point operations during computation, which limits their deployment in situations where the storage space is limited and computational resources are insufficient, such as in mobile phones and small robots. Many network compression methods have been proposed to address the aforementioned issues, including pruning, low-rank decomposition, quantization, etc. However, these methods typically fail to achieve a significant compression ratio in terms of the parameter count. Even when high compression rates are achieved, the network's performance is often significantly deteriorated, making it difficult to perform tasks effectively. In this study, we propose a more compact representation for neural networks, named Quantized Low-Rank Tensor Decomposition (QLTD), to super compress deep convolutional neural networks. Firstly, we employed low-rank Tucker decomposition to compress the pre-trained weights. Subsequently, to further exploit redundancies within the core tensor and factor matrices obtained through Tucker decomposition, we employed vector quantization to partition and cluster the weights. Simultaneously, we introduced a self-attention module for each core tensor and factor matrix to enhance the training responsiveness in critical regions. The object identification results in the CIFAR10 experiment showed that QLTD achieved a compression ratio of 35.43×, with less than 1% loss in accuracy and a compression ratio of 90.61×, with less than a 2% loss in accuracy. QLTD was able to achieve a significant compression ratio in terms of the parameter count and realize a good balance between compressing parameters and maintaining identification accuracy.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve a great compression of deep convolutional neural networks (DCNNs) while maintaining the object recognition accuracy. Specifically, the paper proposes a method named Quantized Low - Rank Tensor Decomposition (QLTD), which combines low - rank tensor decomposition, quantization and self - attention mechanism, aiming to solve the problem that the performance of existing network compression methods drops significantly under extremely high compression ratios. Through this method, the author hopes to deploy more efficient deep - learning models in cases where the storage space is limited and the computing resources are insufficient (such as in mobile phones and small robots). ### Specific problems solved by the paper include: 1. **Large number of parameters**: Existing deep convolutional neural networks have a large number of parameters, resulting in high storage requirements and high computational complexity, which limits their applications on resource - constrained devices. 2. **Limitations of existing compression methods**: Although many network compression methods have been proposed, such as pruning, low - rank decomposition and quantization, these methods are usually difficult to maintain network performance while achieving a high compression ratio. Even if a high compression ratio is achieved, the recognition performance of the network will drop significantly. 3. **Balancing compression and performance**: The method proposed in the paper aims to achieve a great compression of the number of parameters while minimizing the impact on the recognition accuracy, so as to find a good balance between compression and performance. ### Main contributions: 1. **Proposing the QLTD method**: Combining low - rank tensor decomposition, quantization and self - attention mechanism, it achieves an extremely high parameter compression ratio (up to 200 times), and when the compression ratio is less than 100 times, the loss of recognition accuracy is minimal. 2. **Unifying multiple compression techniques**: The QLTD framework integrates the advantages of multiple compression methods such as network pruning, low - rank decomposition and quantization. By sequentially performing Tucker decomposition, permutation and quantization, more efficient network compression is achieved. 3. **Introducing self - attention modules**: Self - attention modules are introduced on each core tensor and factor matrix to focus on key positions and reduce the performance loss caused by permutation and quantization. ### Experimental results: - **CIFAR - 10 experiment**: The QLTD method achieved a compression ratio of 35.43 times on the CIFAR - 10 dataset, with an accuracy loss of less than 1%; at a compression ratio of 90.61 times, the accuracy loss is less than 2%. - **CIFAR - 100 and ImageNet experiments**: It also shows the ability to maintain a relatively high recognition accuracy under high compression ratios. Through these contributions, the paper provides an effective method that enables deep convolutional neural networks to operate efficiently in resource - constrained environments without significantly sacrificing recognition performance.

Towards Super Compressed Neural Networks for Object Identification: Quantized Low-Rank Tensor Decomposition with Self-Attention

A Compression Pipeline for One-Stage Object Detection Model

An Efficient Compressive Convolutional Network for Unified Object Detection and Image Compression

Compressing Deep Convolutional Networks using Vector Quantization

QTTNet: Quantized Tensor Train Neural Networks for 3D Object and Video Recognition.

Focused Quantization for Sparse CNNs

Towards Efficient Network Compression Via Few-Shot Slimming.

Deep neural network compression by Tucker decomposition with nonlinear response

Picking Up Quantization Steps for Compressed Image Classification

Unsupervised Network Quantization via Fixed-Point Factorization

Deep Neural Network Compression With Single and Multiple Level Quantization

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

CNN Acceleration by Low-rank Approximation with Quantized Factors

Compression for Text Detection and Recognition Based on Low Bit-Width Quantization

CLIP-Q: Deep Network Compression Learning by In-parallel Pruning-Quantization

Learning Low Resource Consumption CNN through Pruning and Quantization

Compression of Deep Neural Networks for Image Instance Retrieval

OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization

Efficient Neural Network Compression Inspired by Compressive Sensing.

Lightweight compression of neural network feature tensors for collaborative intelligence