Efficient Inference for Pruned CNN Models on Mobile Devices With Holistic Sparsity Alignment

Yuyang Jin,Runxin Zhong,Saiqin Long,Jidong Zhai
DOI: https://doi.org/10.1109/tpds.2024.3462092
IF: 5.3
2024-10-05
IEEE Transactions on Parallel and Distributed Systems
Abstract:Many artificial intelligence applications based on convolutional neural networks are directly deployed on mobile devices to avoid network unavailability and user privacy leakage. However, the significant increase in model parameter volumes makes it difficult to achieve high-performance convolutional neural network inference on these mobile devices with limited computing power. Weight pruning is one of the main approaches to compress models by reducing model parameters and computational operations, which also introduces irregular sparsity of neural networks, leading to inefficient computation and memory access during inference. This work proposes an end-to-end framework, namely MCPruner, for efficient inference of pruned convolutional neural networks on mobile devices by aligning the sparse patterns with hardware execution features in computation, memory access, and parallelism. It first co-designs pruning methods and code generation optimizations for the alignment of non-zero weight count and vector width, to improve computational efficiency while ensuring accuracy. During the code generation, it applies a sparse pattern-aware format to reduce inefficient memory accesses. Besides, convolution computations are reordered for alignment, and then mapped to parallel threads on accelerated units to achieve high parallelism. Experimental results using several commonly used models and datasets on the ARM-based Hikey970 demonstrate that our work outperforms state-of-the-art methods in inference efficiency, with no accuracy degradation.
computer science, theory & methods,engineering, electrical & electronic
What problem does this paper attempt to address?