Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs

Jiangsu Du,Jiazhi Jiang,Jiang Zheng,Hongbin Zhang,Dan Huang,Yutong Lu
DOI: https://doi.org/10.1145/3617689
IF: 1.444
2023-08-26
ACM Transactions on Architecture and Code Optimization
Abstract:Transformer models have emerged as a leading approach in the field of natural language processing (NLP) and are increasingly being deployed in production environments. Graphic processing units (GPUs) have become a popular choice for the transformer deployment, and often rely on the batch processing technique to ensure high hardware performance. Nonetheless, the current practice for transformer inference encounters computational and memory redundancy due to the heavy-tailed distribution of sequence lengths in NLP scenarios, resulting in low practical performance. In this paper, we propose a unified solution for improving both computation and memory efficiency of the real-world transformer inference on GPUs. The solution eliminates the redundant computation and memory footprint across a transformer model. At first, a GPU-oriented computation approach is proposed to process the self-attention module in a fine-grained manner, eliminating its redundant computation. Next, the multi-layer perceptron module continues to use the word-accumulation approach to eliminate its redundant computation. Then, to better unify the fine-grained approach and the word-accumulation approach, it organizes the data layout of the self-attention module in block granularity. Since aforementioned approaches make the required memory size largely reduce and constantly fluctuate, we propose the chunk-based approach to enable a better balance between memory footprint and allocation/free efficiency. Our experimental results show that our unified solution achieves a decrease of average latency by 28 \(\% \) on the entire transformer model, 63.8 \(\% \) on the self-attention module and reduces memory footprint of intermediate results by 7.8 ×, compared with prevailing frameworks.
computer science, theory & methods, hardware & architecture
What problem does this paper attempt to address?