Quartet: A 22nm 0.09mj/lnference Digital Compute-in-Memory Versatile AI Accelerator with Heterogeneous Tensor Engines and Off-Chip-Less Dataflow

Yikan Qiu,Yufei Ma,Meng Wu,Yifan Jia,Xinyu Qu,Zecheng Zhou,Jincheng Lou,Tianyu Jia,Le Ye,Ru Huang
DOI: https://doi.org/10.1109/cicc60959.2024.10529063
2024-01-01
Abstract:Although the core operations in various AI models can be formulated as matrix multiplication (MM), their characteristics are quite different (Fig. 1). The Q-K-V generation in transformer [1], combination phase in graph convolutional network (GCN) [2], and convolution layers in CNN [4] involve static MM with constant weights, which can be leveraged by compute-in-memory (CIM) to eliminate costly data movements. However, the dominant MM of attention in transformer, aggregation in GCN, and graph construction in vision GNN (ViG) [6] is dynamic that neither input is constant, degrading the benefits brought by CIM. In addition, the varying sparsity of MM in different operators typically demands different zero-skipping granularity, leading to different hardware overheads. Therefore, a domain specific AI accelerator faces three main challenges: 1) the customized design scheme for numerous and every-changing AI operators or models leads to excessive and divergent hardware modules, limiting flexibility and overall utilization [4]; 2) the unified computing array based on CIM cannot efficiently and suitably process MM with varying sparsity, scale, and data formats; 3) the massive data movements between adjacent operators cause frequent and intensive off-chip memory accesses, resulting in high latency and energy consumption.
What problem does this paper attempt to address?