Abstract:This paper explores the combination of neural network quantization and entropy coding for memory footprint minimization. Edge deployment of quantized models is hampered by the harsh Pareto frontier of the accuracy-to-bitwidth tradeoff, causing dramatic accuracy loss below a certain bitwidth. This accuracy loss can be alleviated thanks to mixed precision quantization, allowing for more flexible bitwidth allocation. However, standard mixed precision benefits remain limited due to the 1-bit frontier, that forces each parameter to be encoded on at least 1 bit of data. This paper introduces an approach that combines mixed precision, zero-point quantization and entropy coding to push the compression boundary of Resnets beyond the 1-bit frontier with an accuracy drop below 1% on the ImageNet benchmark. From an implementation standpoint, a compact decoder architecture features reduced latency, thus allowing for inference-compatible decoding.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to minimize the memory footprint of the model while maintaining high inference accuracy by combining quantization and entropy coding techniques when deploying neural networks on edge devices. Specifically, the paper explores how to use a combination of mixed - precision quantization, zero - point quantization and entropy coding to increase the compression rate of the ResNet model to sub - bit level, and in the ImageNet benchmark test, keep the accuracy loss within 1%. ### Main Problems and Challenges 1. **Trade - off between Accuracy and Bit - width**: Traditional quantization methods significantly reduce the inference accuracy of the model while reducing the model size, especially when the bit - width is below a certain threshold. 2. **1 - bit Boundary**: Standard mixed - precision quantization methods are limited by the 1 - bit boundary, that is, each parameter needs at least 1 - bit of data to represent. 3. **Application of Entropy Coding in Inference**: Although entropy coding can further compress data, it is usually difficult to implement during the inference process due to high complexity. ### Solutions 1. **Mixed - precision Quantization**: Allow different layers or channels to use different quantization bit - widths, so as to allocate bit - widths more flexibly. 2. **Zero - point Quantization**: Introduce zero - point quantization to make the quantized weight distribution more symmetrical, thereby reducing the complexity of entropy coding and improving the compression efficiency. 3. **Entropy Coding**: Use Asymmetric Numeral Systems (ANS) for entropy coding, especially to achieve low - latency decoding during the inference process. ### Experimental Results - **ResNet - 18**: On the ImageNet dataset, after using this method, the accuracy of the model only decreased by 0.8%, and the compression rate reached 2.7 times. - **ResNet - 50**: Also on the ImageNet dataset, the accuracy decreased by 0.5% and the compression rate reached 2.3 times. ### Conclusion By combining mixed - precision quantization, zero - point quantization and entropy coding, the memory footprint of neural network models can be significantly reduced while maintaining high inference accuracy, making them more suitable for deployment on resource - limited edge devices.

Efficient Neural Compression with Inference-time Decoding

Pse: Mixed Quantization Framework of Neural Networks for Efficient Deployment

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

Mixed-Precision Quantized Neural Network with Progressively Decreasing Bitwidth For Image Classification and Object Detection.

Model compression as constrained optimization, with application to neural nets. Part II: quantization

NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks

Quantization of Deep Neural Networks for Accurate Edge Computing

Accelerating Neural Network Inference by Overflow Aware Quantization

Residual Quantization for Low Bit-Width Neural Networks.

Bit Efficient Quantization for Deep Neural Networks

Adaptive Layerwise Quantization for Deep Neural Network Compression

Optimized learned entropy coding parameters for practical neural-based image and video compression

VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference

Guaranteed Quantization Error Computation for Neural Network Model Compression

Neural Image Compression with Quantization Rectifier

Improving Neural Network Efficiency Via Post-training Quantization with Adaptive Floating-Point

Bit-Quantized-Net: an Effective Method for Compressing Deep Neural Networks.

Mixed-precision Deep Neural Network Quantization With Multiple Compression Rates

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Neural Network Compression using Binarization and Few Full-Precision Weights

Bandwidth-efficient Inference for Neural Image Compression