A 28nm 15.59µJ/Token Full-Digital Bitline-Transpose CIM-Based Sparse Transformer Accelerator with Pipeline/Parallel Reconfigurable Modes

Fengbin Tu,Zihan Wu,Yiqi Wang,Ling Liang,Liu Liu,Yufei Ding,Leibo Liu,Shaojun Wei,Yuan Xie,Shouyi Yin
DOI: https://doi.org/10.1109/ISSCC42614.2022.9731645
2022-01-01
Abstract:Transformer models have achieved state-of-the-art results in many fields, like natural language processing and computer vision, but their large number of matrix multiplications (MM) result in substantial data movement and computation, causing high latency and energy. In recent years, computing-in-memory (CIM) has been demonstrated as an efficient MM architecture, but a Transformer's attention mechanism of raises new challenges for CIM in both memory access and computation aspects (Fig. 29.3.1): 1a) Unlike conventional static MM with pre-trained weights, the attention layers introduce dynamic MM (QK <sup>T</sup> , A'V), whose weights and inputs are both generated at runtime, leading to redundant off-chip memory access for intermediate data. 1b) A CIM pipeline architecture can mitigate the above problem, but produces a new challenge. Since the K generation direction does not match the conventional CIM write direction, the QK <sup>T</sup> -pipeline needs a large transpose buffer with extra overhead. 2) Compared with fully connected (FC) layers, attention layers dominate a Transformer's computation and require > 8b precision to maintain accuracy, so previous analog CIMs [1]–[2] with <tex>$\leq 8\mathsf{b}$</tex> precision support cannot be directly used. Reducing the amount of computation for attention layers is critical for efficiency improvement.
What problem does this paper attempt to address?