DSTC: Dual-Side Sparsity Tensor Core for DNNs Acceleration on Modern GPU Architectures

Chen Zhang,Yang Wang,Zhiqiang Xie,Cong Guo,Yunxin Liu,Jingwen Leng,Guangyu Sun,Zhigang Ji,Runsheng Wang,Yuan Xie,Ru Huang
DOI: https://doi.org/10.1109/tc.2024.3475814
IF: 3.183
2024-01-01
IEEE Transactions on Computers
Abstract:Leveraging sparsity in deep neural network (DNN) models holds significant promise for accelerating model inference. However, current GPUs can only harness sparsity in model weights, leaving activations unutilized due to their dynamic and unpredictable nature, which poses a considerable challenge for exploitation. In our research, we introduce a novel architectural approach aimed at effectively leveraging dual-side sparsity, encompassing both weight and activation sparsity. Our methodology involves a systematic examination of previous sparsity-related architectures, and culminating in the proposal of an uncharted paradigm that combines outer-product computation primitive and bitmap-based encoding format. Our approach showcases feasibility through minimal modifications to existing production-scale inner-product-based Tensor Cores. We introduce a set of innovative ISA extensions and carefully co-design matrix-matrix multiplication and convolution algorithms, the two predominant computation patterns in contemporary DNN models, to exploit our novel dual-side sparse Tensor Core. Our evaluation demonstrates the efficacy of our design, unlocking the full potential of dual-side DNN sparsity and delivering performance enhancements of up to an order of magnitude while incurring only modest hardware overhead.
What problem does this paper attempt to address?