Downscaling and Overflow-aware Model Compression for Efficient Vision Processors

Haokun Li,Jing Liu,Liancheng Jia,Yun Liang,Yaowei Wang,Mingkui Tan
DOI: https://doi.org/10.1109/ICDCSW56584.2022.00036
2022-01-01
Abstract:Network pruning and quantization are two effective ways for model compression. However, existing model compression methods seldom take hardware into consideration, resulting in compressed models that still take high energy and chip area cost on a vision processor. To address this issue, one may reduce the bit-widths of the accumulator and the multiplier in fixed-point inference to significantly reduce the energy and chip area. However, the numerical error brought from the low-bit multiplier in the downscaling procedure is large, while the low-bit accumulator suffers from the overflow issue. Both of them lead to significant performance degradation. In this paper, we propose downscaling and overflow-aware model compression for efficient vision processors. Specifically, we propose downscaling-aware training to simulate the downscaling procedure during training so that the models are adjusted to inference with low bit-width multipliers. To address the overflow issue, we apply overflow-aware training to gradually reduce the range of quantized values. We further restrict the channel's number of each layer to be the multiple of some value (e.g., 16) to take advantage of parallel computing by channel pruning. With the proposed method, we are able to obtain the compressed model with low bit-width accumulators and multipliers during inference while maintaining the performance. As a result, the energy and chip area cost can be significantly reduced. To demonstrate this, we further co-design an agilely customizable vision processor and its SoC. Extensive experiments on image classification, object detection, and semantic segmentation demonstrate the effectiveness of our proposed method. For example, on ImageNet, our compressed 8-bit ResNet-50 achieves lossless performance with 16-bit accumulators and 12-bit multipliers.
What problem does this paper attempt to address?