A 28nm 4.35tops/mm2 Transformer Accelerator with Basis-vector Based Ultra Storage Compression, Decomposed Computation and Unified LUT-Assisted Cores

Chen Tang,Xinyuan Lin,Zongle Huang,Wenyu Sun,Hongyang Jia,Yongpan Liu
DOI: https://doi.org/10.1109/vlsitechnologyandcir46783.2024.10631311
2024-01-01
Abstract:The area-efficient Transformer accelerator exploiting matrix redundancy is presented with four features: 1) A proposed basis-vector decomposition sparing 25.5x model storage for Transformer like Bert-Base, allowing full on-chip inference on devices with about 13MB memory like smartphones, at only 1.28% accuracy loss. 2) An area-efficient self-programming LUT-assisted computing cell by result prefetch; 3) A unified task-insensitive core supporting fast decomposed computing, resulting in a remarkable 73% energy saving; 4) A NoC design facilitating hybrid data reuse to reduce communication. It achieves 4.35 $\text{TOPS} /\text{mm}^{2}$ dense area efficiency, 4 times than the state-of-the-art counterpart at same fabrication level. It also demonstrates 213%-429% higher overall energy efficiency.
What problem does this paper attempt to address?