PruneAug: Bridging DNN Pruning and Inference Latency on Diverse Sparse Platforms Using Automatic Layerwise Block Pruning
Hanfei Geng,Yifei Liu,Yujie Zheng,Li Lyna Zhang,Jingwei Sun,Yujing Wang,Yang Wang,Guangzhong Sun,Mao Yang,Ting Cao,Yunxin Liu
DOI: https://doi.org/10.1109/tc.2024.3441855
IF: 3.183
2024-10-12
IEEE Transactions on Computers
Abstract:Although pruning is an effective technique to reduce the number of weights in deep neural networks (DNNs), it remains challenging for the resulting sparse networks to perform low-latency inference on everyday hardware. This problem is mainly caused by the incompatibility between the unstructured sparsity adopted for accuracy preservation and the sparse platform's (the combination of sparse kernel library and the underlying hardware) expectation of regular sparse patterns. In order to resolve this conflict, we propose PruneAug, an augmentation over existing unstructured pruning methods that finds block-sparse networks with much lower latency but preserves the accuracy. The fundamental idea of PruneAug is to prune the network with a layerwise block dimension assignment in a platform-aware fashion. Subject to an accuracy-loss constraint, PruneAug minimizes the latency of the block sparse network by jointly optimizing this layerwise block dimension assignment and the network's sparsity level. Admittedly, this approach expands the solution space. To curb our search cost, we include multiple optimizations while designing PruneAug's search space and strategy. Our evaluation over diverse pruning methods, DNNs, datasets, and sparse platforms shows that PruneAug enables different pruning methods to achieve speedup (as much as ∼13× depending on the platform) while maintaining competitive accuracy relative to unstructured sparsity, extracting the full potential of sparse platforms.
engineering, electrical & electronic,computer science, hardware & architecture