Efficient Neural Compression with Inference-time Decoding

C. Metz,O. Bichler,A. Dupret
2024-06-10
Abstract:This paper explores the combination of neural network quantization and entropy coding for memory footprint minimization. Edge deployment of quantized models is hampered by the harsh Pareto frontier of the accuracy-to-bitwidth tradeoff, causing dramatic accuracy loss below a certain bitwidth. This accuracy loss can be alleviated thanks to mixed precision quantization, allowing for more flexible bitwidth allocation. However, standard mixed precision benefits remain limited due to the 1-bit frontier, that forces each parameter to be encoded on at least 1 bit of data. This paper introduces an approach that combines mixed precision, zero-point quantization and entropy coding to push the compression boundary of Resnets beyond the 1-bit frontier with an accuracy drop below 1% on the ImageNet benchmark. From an implementation standpoint, a compact decoder architecture features reduced latency, thus allowing for inference-compatible decoding.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to minimize the memory footprint of the model while maintaining high inference accuracy by combining quantization and entropy coding techniques when deploying neural networks on edge devices. Specifically, the paper explores how to use a combination of mixed - precision quantization, zero - point quantization and entropy coding to increase the compression rate of the ResNet model to sub - bit level, and in the ImageNet benchmark test, keep the accuracy loss within 1%. ### Main Problems and Challenges 1. **Trade - off between Accuracy and Bit - width**: Traditional quantization methods significantly reduce the inference accuracy of the model while reducing the model size, especially when the bit - width is below a certain threshold. 2. **1 - bit Boundary**: Standard mixed - precision quantization methods are limited by the 1 - bit boundary, that is, each parameter needs at least 1 - bit of data to represent. 3. **Application of Entropy Coding in Inference**: Although entropy coding can further compress data, it is usually difficult to implement during the inference process due to high complexity. ### Solutions 1. **Mixed - precision Quantization**: Allow different layers or channels to use different quantization bit - widths, so as to allocate bit - widths more flexibly. 2. **Zero - point Quantization**: Introduce zero - point quantization to make the quantized weight distribution more symmetrical, thereby reducing the complexity of entropy coding and improving the compression efficiency. 3. **Entropy Coding**: Use Asymmetric Numeral Systems (ANS) for entropy coding, especially to achieve low - latency decoding during the inference process. ### Experimental Results - **ResNet - 18**: On the ImageNet dataset, after using this method, the accuracy of the model only decreased by 0.8%, and the compression rate reached 2.7 times. - **ResNet - 50**: Also on the ImageNet dataset, the accuracy decreased by 0.5% and the compression rate reached 2.3 times. ### Conclusion By combining mixed - precision quantization, zero - point quantization and entropy coding, the memory footprint of neural network models can be significantly reduced while maintaining high inference accuracy, making them more suitable for deployment on resource - limited edge devices.