HSViT: A Hardware and Software Collaborative Design for Vision Transformer Via Multi-level Compression

HongRui Song,Liang Xu,Ya Wang,Xiao Wu,Meiqi Wang,Zhongfeng Wang
DOI: https://doi.org/10.1109/iscas58744.2024.10557837
2024-01-01
Abstract:The rapid advancement of Vision Transformer (ViT) models has greatly enhanced performance in computer vision tasks. However, deploying ViTs in resource-constrained environments presents a challenge as attention computation forms a bottleneck, necessitating extensive memory and computation resources. To address this issue, we propose HSViT, a dedicated hardware and software co-design framework specified for ViT. HSViT introduces a configurable and efficient accelerator with dedicated dataflows that takes advantage of the multi-level compression, including feature map compression, token pruning and hardware-friendly sparsity. The proposed accelerator reduces intermediate transmission for feature maps and Query, Key, and Value matrices while enhancing data reuse and processing element utilization for chain matrix multiplications. Moreover, an innovative Top-k engine, integrated into the accelerator, is presented to support various selection scenarios with high speed and low resource consumption. Experiments validate that the proposed HSViT delivers significant speedups of 123.91x, 29.5x, and 3.01 x 20.65x over conventional CPUs, GPUs, and prior arts, respectively. HSViT also achieves the throughput of up to 731.5 GOP/s and PE utilization as high as 92%.
What problem does this paper attempt to address?