PF‐GEMV: Utilization maximizing architecture in fast matrix–vector multiplication for GPT‐2 inference

Hyeji Kim,Yeongmin Lee,Chun‐Gi Lyuh
DOI: https://doi.org/10.4218/etrij.2024-0111
2024-10-29
ETRI Journal
Abstract:Owing to the widespread advancement of transformer‐based artificial neural networks, artificial intelligence (AI) processors are now required to perform matrix–vector multiplication in addition to the conventional matrix–matrix multiplication. However, current AI processor architectures are optimized for general matrix–matrix multiplications (GEMMs), which causes significant throughput degradation when processing general matrix–vector multiplications (GEMVs). In this study, we proposed a port‐folding GEMV (PF‐GEMV) scheme employing multiformat and low‐precision techniques while reusing an outer product‐based processor optimized for conventional GEMM operations. This approach achieves 93.7% utilization in GEMV operations with an 8‐bit format on an 8 × 8 processor, thus resulting in a 7.5 × increase in throughput compared with that of the original scheme. Furthermore, when applied to the matrix operation of the GPT‐2 large model, an increase in speed by 7 × is achieved in single‐batch inferences.
telecommunications,engineering, electrical & electronic
What problem does this paper attempt to address?