16.2 A 28nm 53.8TOPS/W 8b Sparse Transformer Accelerator with In-Memory Butterfly Zero Skipper for Unstructured-Pruned NN and CIM-Based Local-Attention-Reusable Engine

Shiwei Liu,Peizhe Li,Jinshan Zhang,Yunzhengmao Wang,Haozhe Zhu,Wenning Jiang,Shan Tang,Chixiao Chen,Qi Liu,Ming Liu
DOI: https://doi.org/10.1109/isscc42615.2023.10067360
2023-01-01
Abstract:Transformer networks, from BERT, GPT to Alphafold, have demonstrated unprecedented advances in a variety of AI tasks. Fig. 16.2.1 shows the computing flow of self-attention - the fundamental operation in transformers. Queries $(Q)$ , keys $(K)$ and values (V) are first obtained by multiplying inputs with 3 weight matrices. Afterward, scores that evaluate $Q-K$ relevance are computed as scaled dot products and converted to probabilities through the softmax function. The probabilities are then multiplied by $V$ generating the final self-attention results. Transformer networks have led to an explosion in parameter counts, for example, 175B parameters for GPT-3. This demands significant growth in computing hardware and memory. Owing to expanding network sizes and corresponding power consumption, compute-in-memory (CIM) block-wise sparsity-aware architectures were proposed for matrix multiplication [1] and local attention [2] accelerators, where weight storage and compute are skipped for zero-value blocks. Yet, such structured sparsity is at the cost of notable accuracy loss [3]. Consequently, a challenge for CIM-based accelerators is in how to handle unstructured pruned NNs, while maintaining high efficiency. These unstructured patterns can be represented as: 1) irregularly distributed zero weights inside matrices, and 2) varied local attention s pans for different attention heads.
What problem does this paper attempt to address?