Balance Multi-Head Attention Based on Software and Hardware Co-design

Dian Xu,Wei Hu,Fang Liu,Zimeng Fan,Qingsong Shi
DOI: https://doi.org/10.1109/cscloud-edgecom54986.2022.00018
2022-01-01
Abstract:Recently, the Transformer-based models have achieved leading results in many research areas such as natural language processing and computer vision. However, since Transformer-based models have a huge computational complexities, optimization for Transformer is the focus of current research. the core of Transformer models is the attention module, and in this paper we propose a software and hardware co-designed attention module, which reduces the computational units of the multi-head attention module by balancing the The inference time of the attention module is reduced by balancing the computational units of multiple attention modules. At the same time, we design the corresponding structure on FPGA, which accelerate 14.6x inference time in comparison with the original attention module implementation on GPU. We applied this attention module to multiple visual Transformer models and tested the accuracy difference between the balanced multi-head attention module-based model and the original model on CIFAR-10, and obtained a result of 80.46%, which lower from the original version by 0.18%.
What problem does this paper attempt to address?