TranCIM: Full-Digital Bitline-Transpose CIM-based Sparse Transformer Accelerator With Pipeline/Parallel Reconfigurable Modes

Fengbin Tu,Zihan Wu,Yiqi Wang,Ling Liang,Liu Liu,Yufei Ding,Leibo Liu,Shaojun Wei,Yuan Xie,Shouyi Yin
DOI: https://doi.org/10.1109/jssc.2022.3213542
IF: 5.4
2022-01-01
IEEE Journal of Solid-State Circuits
Abstract:Transformer models achieve excellent results in the fields like natural language processing, computer vision, and bioinformatics. Their large numbers of matrix multiplications (MMs) lead to substantial data movement and computation. Although computing-in-memory (CIM) has proven to be an efficient architecture for MM computation, transformer’s attention mechanism raises new challenges in memory access and computation aspects: the dynamic MM in attention layers causes redundant OFF-chip memory access; Attention layers dominate transformer’s computation and require high precision. Thus, we design a bitline-transpose CIM-based transformer accelerator TranCIM with pipeline/parallel reconfigurable modes. The pipeline mode alleviates off-chip access for attention layers. The parallel mode is used by fully-connected (FC) layers for high parallelism. The full-digital CIM supports INT16 for attention layers and INT8 for FC layers, without analog CIM’s nonideal issues. Moreover, a sparse attention scheduler (SAS) is proposed to reduce attention computation. The fabricated TranCIM chip only consumes 15.59 $mu ext{J}$ /Token for the bidirectional encoder representations from transformer (BERT)-base model, achieving $12.08 imes $ – $36.82 imes $ lower energy than prior CIM-based accelerators.
engineering, electrical & electronic
What problem does this paper attempt to address?