Look-Up mAI GeMM: Increasing AI GeMMs Performance by Nearly 2.5x via msGeMM

Saeed Maleki
2023-10-10
Abstract:AI models are increasing in size and recent advancement in the community has shown that unlike HPC applications where double precision datatype are required, lower-precision datatypes such as fp8 or int4 are sufficient to bring the same model quality both for training and inference. Following these trends, GPU vendors such as NVIDIA and AMD have added hardware support for fp16, fp8 and int8 GeMM operations with an exceptional performance via Tensor Cores. However, this paper proposes a new algorithm called msGeMM which shows that AI models with low-precision datatypes can run with ~2.5x fewer multiplication and add instructions. Efficient implementation of this algorithm requires special CUDA cores with the ability to add elements from a small look-up table at the rate of Tensor Cores.
Performance,Artificial Intelligence,Distributed, Parallel, and Cluster Computing,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the performance of low - precision matrix multiplication (GeMM) operations in artificial intelligence (AI) models. Specifically, the paper proposes a new algorithm - msGeMM (Microsoft GeMM), which uses a look - up table to reduce the number of multiplication and addition instructions required for low - precision data types (such as int4 or fp8) during matrix multiplication, thereby achieving a significant performance improvement. ### Background As the scale of AI models continues to increase, especially with the success of large - language models (LLM) such as GPT - 4 and Llama - 2, these models need to store a large number of parameter weights. However, research shows that, unlike the double - precision data type (fp64) used in high - performance computing (HPC) applications, low - precision data types (such as fp8 or int4) can also maintain the quality of the model during training and inference. This makes the use of low - precision data types on GPUs can significantly reduce memory requirements and increase computing speed. ### Core of the Problem The core problem of the paper is how to further optimize the matrix multiplication (GeMM) operation by taking advantage of the characteristics of low - precision data types to reduce the amount of computation and improve performance. Traditional GeMM operations still require a large number of multiplication and addition instructions when dealing with low - precision data. The paper proposes that by constructing a look - up table, the number of these instructions can be significantly reduced. ### Specific Method The paper proposes the msGeMM algorithm, which is divided into two stages: 1. **Generate Look - up Table**: According to the characteristics of low - precision data types, pre - calculate and store all possible linear combination results to form a look - up table. 2. **Use Look - up Table**: During actual calculation, directly obtain the pre - calculated results through the look - up table instead of recalculating each element. ### Performance Improvement In this way, the paper shows that for GeMM operations in modern AI models, using the msGeMM algorithm can reduce the number of required multiplication and addition instructions by about 2.5 times. Specifically, for the two main GeMM operations (MLP1 and MLP2) in the GPT - 3 model, when the depth \(d\) of the look - up table is 3, a performance improvement of about 2.5 times can be achieved. ### Hardware Support Although the msGeMM algorithm has significant theoretical performance advantages, current GPU hardware (such as NVIDIA A100) has limitations in implementing this algorithm. In particular, the second - stage calculation (i.e., the use of the look - up table) cannot fully utilize the high performance of Tensor Core. Therefore, the paper suggests adding special CUDA cores in the next - generation GPU so as to achieve a performance level similar to that of Tensor Core during the use of the look - up table. ### Conclusion In conclusion, this paper proposes a new algorithm, msGeMM, which significantly improves the performance of GeMM operations in AI models by taking advantage of the characteristics of low - precision data types. However, to fully exploit the advantages of this algorithm, hardware - level support is also required.