UCViT: Hardware-Friendly Vision Transformer via Unified Compression

HongRui Song,Ya Wang,Meiqi Wang,Zhongfeng Wang
DOI: https://doi.org/10.1109/ISCAS48785.2022.9937660
2022-01-01
Abstract:Vision Transformer (ViT) has emerged as a powerful model with its extraordinary performance on multiple computer vision applications. However, the huge model size and the enormous energy consumption incurred by the dense matrix multiplications make ViT hard to be implemented on edge devices. To tackle these challenges, we develop a unified compression framework for Vision Transformer (UCViT), whose main focus is on compressing the original ViT model by incorporating the low bit-width quantization and the dense matrix decomposition. To maximally reduce the energy expenditure, we propose a dedicated design by leveraging aggressive quantization, in which the majority of the matrix multiplications are converted to the hardware-friendly shift and addition operations. Besides, we incorporate a small module into the quantized model by harnessing the unique characteristic of multi-head attention during matrix decomposition, which achieves significant accuracy recovery from the deeply compressed model with minimal impact on the energy efficiency. Benefited from the effective fusion of different compression techniques and the hardware-friendly operations, the proposed model can save up to 98% energy consumption in inference compared to the original ViT model. Experiments on CIFAR-10 and CIFAR-100 image classification tasks show that the proposed model obtains a highly compact structure with a competitive compression ratio (up to 6.7x), while causes small loss (less than 1%) on the accuracy.
What problem does this paper attempt to address?