16.1 MuITCIM: A 28nm <tex>$2.24 \mu\mathrm{J}$</tex>/Token Attention-Token-Bit Hybrid Sparse Digital CIM-Based Accelerator for Multimodal Transformers

Fengbin Tu,Zihan Wu,Yiqi Wang,Weiwei Wu,Leibo Liu,Yang Hu,Shaojun Wei,Shouyi Yin
DOI: https://doi.org/10.1109/ISSCC42615.2023.10067842
2023-01-01
Abstract:Human perception is multimodal and able to comprehend a mixture of vision, natural language, speech, etc. Multimodal Transformer (MuIT, Fig. 16.1.1) models introduce a cross-modal attention mechanism to vanilla transformers to learn from different modalities, achieving excellent results on multimodal AI tasks like video question answering and multilingual image retrieval. Transformers require specialized hardware for efficient inference [1]. Prior work demonstrates that a Compute-In-Memory (CIM) accelerator with attention sparsity can efficiently process vanilla transformers [2]. Multimodal signals like video and audio exhibit diverse token significance, providing new opportunities for token sparsity via runtime pruning [3]. Additionally, activation functions like GELU and softmax produce many near-zero values that expose bit sparsity in the most-significant bits (MSB). In utilizing attention-token-bit hybrid sparsity, there are three challenges: 1) For attention sparsity, irregular patterns result in long reuse distance, which requires CIM to hold infrequently used weights, lowering CIM utilization. 2) Although token sparsity reduces computation, MuIT's cross-modal attention processes tokens from two modalities with different token lengths (N) and embedding dimensionality <tex>$(\mathrm{d}_{\mathrm{m}})$</tex> , causing high latency in cross-modal switch. 3) At the bit level, since token sparsity reduces value locality, a CIM macro has more variance in effective bitwidth for the same group of inputs. In a conventional CIM's bit-serial MAC scheme, computation time is defined by the longest bitwidth.
What problem does this paper attempt to address?