DEQ: Dynamic Element-wise Quantization for Efficient Attention Architecture

Xuhang Wang,Zhuoran Song,Qiyue Huang,Xiaoyao Liang
DOI: https://doi.org/10.1109/iccd58817.2023.00098
2023-01-01
Abstract:Attention-based models, such as transformers, have achieved remarkable success across various tasks. However, their deployment is hindered by challenges such as high memory requirements, long inference latency, and significant power consumption. Quantization has emerged as an effective approach to address these challenges by reducing the bit-width of the model. However, existing quantization algorithms suffer from too coarse-grained quantization granularity or statically determining the bit-width of tokens, lacking the flexibility needed to achieve maximum performance improvement. Accordingly, in this paper, we present a Dynamic Element-wise Quantization (DEQ) algorithm that dynamically tunes tokens’ bit-width according to the importance of elements in the attention possibilities matrix.On the hardware side, we design three versions of DEQ architectures to progressively improve the performance of the DEQ algorithm. The proposed DEQ architecture can address the under-utilization and workload imbalance problems by 1) supporting multiple precision computations on a single systolic array for generality, 2) decoupling the rows in the systolic array for enough flexibility, 3) identifying and parallelizing the independent computations within one systolic array for high parallelism. Extensive experiment results demonstrate that DEQ can achieve satisfactory performance speedups and energy saving compared to state-of-the-art designs.
What problem does this paper attempt to address?