ULSeq-TA: Ultra-Long Sequence Attention Fusion Transformer Accelerator Supporting Grouped Sparse Softmax and Dual-Path Sparse LayerNorm

Jingyu Wang,Lu Zhang,Xueqing Li,Huazhong Yang,Yongpan Liu
DOI: https://doi.org/10.1109/tcad.2023.3329039
IF: 2.9
2023-01-01
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Abstract:Transformer networks have been increasingly successful in various fields. The input sequence lengths have become much larger as the algorithm and task complexity develops, which is challenging due to high computational and storage cost. Softmax and LayerNorm are bottleneck nonlinear operators in ultra-long sequence Transformer networks. To improve the efficiency of Softmax, assumption-based and quantization-based Softmax approaches are introduced. However, the sparsity potential to accelerate Softmax itself is not fully discovered. To improve the efficiency of LayerNorm, some works reduce the input size, and some works explore the pipeline. However, the sparsity potential is also not yet explored. To address these challenges, this article presents the ULSeq-TA software–hardware co-design framework. The software includes 1) the grouped sparse Softmax method to leverage the data magnifying characteristic to explore the middle and post-Softmax sparse processing and 2) the dual-path sparse LayerNorm method which explores the dimensional significance for sparse calculation. The hardware includes 1) an attention fusion architecture which reduces the on-chip memory with fused operators; 2) the grouped sparse Softmax core; and 3) the dual-path sparse LayerNorm core. Experiments show that the software achieves $4.45\times $ and $7.59\times $ computation reduction with little output difference for Softmax and LayerNorm, respectively. The hardware architecture supports at most 32768 sequence length with only 186-kB on-chip memory and achieves $1.75\times -1.98\times $ and $3.22\times -4.32\times $ speedups for sparse Softmax core and sparse LayerNorm core with little accuracy loss, respectively.
What problem does this paper attempt to address?