A 400mhz NPU with 7.8TOPS2/W High-PerformanceGuaranteed Efficiency in 55nm for Multi-Mode Pruning and Diverse Quantization Using Pattern-Kernel Encoding and Reconfigurable MAC Units

Zhanhong Tan,Sia-Huat Tan,Jan-Henrik Lambrechts,Yannian Zhang,Yifu Wu,Kaisheng Ma
DOI: https://doi.org/10.1109/cicc51472.2021.9431519
2021-01-01
Abstract:Deep neural networks present a promising future in applications, ranging from face ID on mobile phones to self-driving cars. Weight pruning and quantization act as valuable solutions to release the burden of computation and memory. Figure 1 shows the family of weight pruning, including the fine-grained and several structural pruning methods. With similar compression rates, coarse-grained pruning results in more accuracy drop. A new structural solution called pattern pruning [5] achieves excellent precision with uniform sparsity rates among kernels, which is friendly to hardware. Kernels are encoded into non-zero values with sparse pattern masks (SPM). This work adopts 16 types of patterns with 4b SPM for the 3x3 convolution, which gains up to 8x compression for eight-zero kernels. As for quantization, the optimal choice generally depends on models.
What problem does this paper attempt to address?