LOCP: Latency-optimized channel pruning for CNN inference acceleration on GPUs
Yonghua Zhang,Hongxu Jiang,Yuting Zhu,Runhua Zhang,Yongxiang Cao,Chenhui Zhu,Wei Wang,Dong Dong,Xiaobin Li
DOI: https://doi.org/10.1007/s11227-023-05212-4
IF: 3.3
2023-04-06
The Journal of Supercomputing
Abstract:Channel pruning has recently become a widely used model compression method. However, most existing channel pruning methods only prune to decrease the model size, such as the number of parameters or FLOPs, and hence the decrease in model size does not effectively lead to an improvement in inference performance. To address this problem, this paper proposes a latency-optimized channel pruning method for CNN inference acceleration on GPU platforms by latency stair-step discrimination, two-stage benefit assessment and latency-sharing channel pruning. Compared with recent state-of-the-art model compression methods, it can achieve significant improvements in inference performance with comparable compression rates and model accuracy. The contributions of this paper include the following: first, a three-point latency stair-step discrimination method is proposed for determining the candidate prunable coordinates with the best latency performance adapted to the current hardware. Then, a two-stage benefit assessment method based on interlayer dependencies is proposed for determining the optimal channel pruning rate of each layer in the network. Finally, a latency-sharing channel pruning framework is proposed to accelerate the model pruning adaptation process. The method proposed in this paper can significantly reduce the model inference latency on multiple types of GPU platforms. To verify the effectiveness, we use three general-purpose GPU platforms and two embedded GPU platforms to evaluate the algorithm performance. The experimental results show that for recent state-of-the-art CNNs, the proposed method can achieve a 22.0–6.6% latency reduction and a 1.3 –3.0 inference performance improvement as well as a 1.2–4.3 pruning adaptation speedup with high model accuracy.
computer science, theory & methods,engineering, electrical & electronic, hardware & architecture