A 28nm 49.7TOPS/W Sparse Transformer Processor with Random-Projection-Based Speculation, Multi-Stationary Dataflow, and Redundant Partial Product Elimination
Yubin Qin,Yang Wang,Dazheng Deng,Xiaolong Yang,Zhiren Zhao,Yang Zhou,Yuanqi Fan,Jingchuan Wei,Tianbao Chen,Leibo Liu,Shaojun Wei,Yang Hu,Shouyi Yin
DOI: https://doi.org/10.1109/a-sscc58667.2023.10347953
2023-01-01
Abstract:Transformer models have shown remarkable performance in various fields with significant accuracy improvement compared to traditional artificial intelligence models [1], [2]. However, the high computational and memory complexity of Transformer models limits their deployment on power-constrained edge devices. Dynamic sparse attention (DSA) is a feasible method to improve throughput and energy efficiency by predicting significant attention computation during inference and then skipping insignificant ones [3], [4]. Although a high sparsity ratio has the potential for significant energy savings, realizing high energy efficiency via DSA is challenging due to three factors, as shown in Fig. 1. Firstly, traditional processors apply single-stage sparsity speculation method, which incurs low speculation benefits in DSA due to a dilemma: the sparsity speculation of attention relies on computing the entire QK matrix [3], which already accounts for 30-48.9% of attention block computation. Secondly, DSA introduces dynamic sparse data distribution and matrix sizes for the input/weight/output (IWO), resulting in redundant memory access when computing with traditional single X-stationary dataflow (X stands for I/W/O) [4]–[6]. Each of the IWO matrices can be the ideal stationary object to achieve minimum memory access. Finally, the attention scores computed after DSA speculation usually have similar significance, involving several identical partial products and sums (PP and PS), which have no contribution since adding the same value to each input of softmax does not affect its outputs.