Exploiting Sparsity to Accelerate Fully Connected Layers of CNN-Based Applications on Mobile SoCs

Xinfeng Xie,Dayou Du,Qian Li,Yun Liang,Wai Teng Tang,Zhong Liang Ong,Mian Lu,Huynh Phung Huynh,Rick Siow Mong Goh
DOI: https://doi.org/10.1145/3122788
2017-01-01
ACM Transactions on Embedded Computing Systems
Abstract:Convolutional neural networks (CNNs) are widely employed in many image recognition applications. With the proliferation of embedded and mobile devices, such applications are becoming commonplace on mobile devices. Network pruning is a commonly used strategy to reduce the memory and storage footprints of CNNs on mobile devices. In this article, we propose customized versions of the sparse matrix multiplication algorithm to speed up inference on mobile devices and make it more energy efficient. Specifically, we propose a Block Compressed Sparse Column algorithm and a bit-representation-based algorithm (BitsGEMM) that exploit sparsity to accelerate the fully connected layers of a network on the NVIDIA Jetson TK1 platform. We evaluate the proposed algorithms using real-world object classification and object detection applications. Experiments show that performance speedups can be achieved over the original baseline implementation using cuBLAS. On object detection CNNs, an average speedup of 1.82× is obtained over baseline cuBLAS in the fully connected layer of the VGG model, whereas on classification CNNs, an average speedup of 1.51× is achieved for the fully connected layer of the pruned-VGG model. Energy consumption reduction of 43--46% is also observed due to decreased computational and memory bandwidth demands.
What problem does this paper attempt to address?