Ayaka: A Versatile Transformer Accelerator with Low-Rank Estimation and Heterogeneous Dataflow
Yubin Qin,Yang Wang,Dazheng Deng,Xiaolong Yang,Zhiren Zhao,Yang Zhou,Yuanqi Fan,Jingchuan Wei,Tianbao Chen,Leibo Liu,Shaojun Wei,Yang Hu,Shouyi Yin
DOI: https://doi.org/10.1109/jssc.2024.3397189
2024-01-01
Abstract:Transformer model has demonstrated outstanding performance in the field of artificial intelligence. However, its remarkable performance comes at the cost of substantial computational complexity, posing limitations on deploying transformers from cloud to edge due to power and throughput constraints. There are two main challenges in designing a transformer accelerator for practical tasks. First, a transformer has inconsistent bottlenecks due to input length changes: for short inputs, such as using vision transformer (ViT) for ImageNet or bidirectional encoder representations from transformers (BERT) for general language understanding evaluation (GLUE), the linear layer of the model becomes the computational bottleneck. In contrast, for long inputs, such as high-resolution images or long-text tasks, attention computation becomes the bottleneck. Second, even for a given input length, different layers in the model exhibit various computational characteristics and workloads, such as matrix sizes and data reuse strategies. This article introduces Ayaka, a versatile transformer accelerator designed to address these issues. Ayaka uses a cross-layer sparse prediction approach based on random projection (RP), enabling simultaneous sparsification of attention computation and linear layers, thereby enhancing throughput for various bottlenecks for different input lengths. Furthermore, Ayaka optimizes the sparse attention computation by leveraging the input translation invariance of softmax. In addition, Ayaka features a heterogeneous dataflow processing element (HDPE) design, dynamically adjusting stationary matrix operands based on the current computation to maximize on-chip data reuse and reduce memory footprint. With these features, Ayaka is so far the first accelerator that accelerates the whole attention layer. Evaluation of 12 typical models and tasks shows that it achieves a peak energy efficiency of 49.7 TOPS/W, which is 1.20–258.9 $\times$ higher than the state-of-the-art works.