A 28nm 27.5TOPS/W Approximate-Computing-Based Transformer Processor with Asymptotic Sparsity Speculating and Out-of-Order Computing.
Yang Wang,Yubin Qin,Dazheng Deng,Jingchuan Wei,Yang Zhou,Yuanqi Fan,Tianbao Chen,Hao Sun,Leibo Liu,Shaojun Wei,Shouyi Yin
DOI: https://doi.org/10.1109/ISSCC42614.2022.9731686
2022-01-01
Abstract:Recently, Transformer-based models have achieved tremendous success in many AI fields, from NLP to CV, using the attention mechanism [1]–[3]. This mechanism captures the global correlations of input by indicating every two tokens' relevance with attention scores and uses normalized scores, defined as attention probabilities, to weight all input tokens to obtain output tokens with a global receptive field. A Transformer model consists of multiple blocks, named multi-head, working with the attention mechanism. Figure 29.2.1 details the computation of an attention block with query (Q), key (K), and value-matrix (V), computed by tokens and weight matrices. First, Q is multiplied by KT to generate the attention score matrix. The scores in each row, represented as
<tex>$\mathrm{X}_{\mathrm i}$</tex>
, indicate a token's relevance with all others. Second, the row-wise softmax with inputs of
<tex>$\mathrm{X}_{\mathrm{i}}-\mathrm{X}_{\max}$</tex>
normalizes attention scores to probabilities (P), expanding the large scores and reducing the small scores exponentially. Finally, probabilities are quantized and then multiplied by V to produce the output. Each output token is a weighted sum of all input tokens, where the strongly related tokens have large weight values. Global attention-based models achieve 20.4% higher accuracy than LSTM for NLP and 15.1% higher accuracy than ResNet-152 for classification.