MulTCIM: Digital Computing-in-Memory-Based Multimodal Transformer Accelerator With Attention-Token-Bit Hybrid Sparsity

Fengbin Tu,Zihan Wu,Yiqi Wang,Weiwei Wu,Leibo Liu,Yang Hu,Shaojun Wei,Shouyi Yin
DOI: https://doi.org/10.1109/jssc.2023.3305663
IF: 5.4
2023-01-01
IEEE Journal of Solid-State Circuits
Abstract:Multimodal Transformers are emerging artificial intelligence (AI) models that comprehend a mixture of signals from different modalities like vision, natural language, and speech. The attention mechanism and massive matrix multiplications (MMs) cause high latency and energy. Prior work has shown that a digital computing-in-memory (CIM) network can be an efficient architecture to process Transformers while maintaining high accuracy. To further improve energy efficiency, attention-token-bit hybrid sparsity in multimodal Transformers can be exploited. The hybrid sparsity significantly reduces computation, but the irregularity also harms CIM utilization. To fully utilize the attention-token-bit hybrid sparsity of multimodal Transformers, we design a digital CIM-based accelerator called MulTCIM with three corresponding features: The long reuse elimination dynamically reshapes the attention pattern to improve CIM utilization. The runtime token pruner (RTP) removes insignificant tokens, and the modal-adaptive CIM network (MACN) exploits symmetric modal overlapping to reduce CIM idleness. The effective bitwidth-balanced CIM (EBB-CIM) macro balances input bits across in-memory multiply-accumulations (MACs) to reduce computation time. The fabricated MulTCIM consumes only 2.24 $\mu$ J/Token for the ViLBERT-base model, achieving 2.50 $\times$ –5.91 $\times$ lower energy than previous Transformer accelerators and digital CIM accelerators.
engineering, electrical & electronic
What problem does this paper attempt to address?