Abstract:AI models are increasing in size and recent advancement in the community has shown that unlike HPC applications where double precision datatype are required, lower-precision datatypes such as fp8 or int4 are sufficient to bring the same model quality both for training and inference. Following these trends, GPU vendors such as NVIDIA and AMD have added hardware support for fp16, fp8 and int8 GeMM operations with an exceptional performance via Tensor Cores. However, this paper proposes a new algorithm called msGeMM which shows that AI models with low-precision datatypes can run with ~2.5x fewer multiplication and add instructions. Efficient implementation of this algorithm requires special CUDA cores with the ability to add elements from a small look-up table at the rate of Tensor Cores.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the performance of low - precision matrix multiplication (GeMM) operations in artificial intelligence (AI) models. Specifically, the paper proposes a new algorithm - msGeMM (Microsoft GeMM), which uses a look - up table to reduce the number of multiplication and addition instructions required for low - precision data types (such as int4 or fp8) during matrix multiplication, thereby achieving a significant performance improvement. ### Background As the scale of AI models continues to increase, especially with the success of large - language models (LLM) such as GPT - 4 and Llama - 2, these models need to store a large number of parameter weights. However, research shows that, unlike the double - precision data type (fp64) used in high - performance computing (HPC) applications, low - precision data types (such as fp8 or int4) can also maintain the quality of the model during training and inference. This makes the use of low - precision data types on GPUs can significantly reduce memory requirements and increase computing speed. ### Core of the Problem The core problem of the paper is how to further optimize the matrix multiplication (GeMM) operation by taking advantage of the characteristics of low - precision data types to reduce the amount of computation and improve performance. Traditional GeMM operations still require a large number of multiplication and addition instructions when dealing with low - precision data. The paper proposes that by constructing a look - up table, the number of these instructions can be significantly reduced. ### Specific Method The paper proposes the msGeMM algorithm, which is divided into two stages: 1. **Generate Look - up Table**: According to the characteristics of low - precision data types, pre - calculate and store all possible linear combination results to form a look - up table. 2. **Use Look - up Table**: During actual calculation, directly obtain the pre - calculated results through the look - up table instead of recalculating each element. ### Performance Improvement In this way, the paper shows that for GeMM operations in modern AI models, using the msGeMM algorithm can reduce the number of required multiplication and addition instructions by about 2.5 times. Specifically, for the two main GeMM operations (MLP1 and MLP2) in the GPT - 3 model, when the depth \(d\) of the look - up table is 3, a performance improvement of about 2.5 times can be achieved. ### Hardware Support Although the msGeMM algorithm has significant theoretical performance advantages, current GPU hardware (such as NVIDIA A100) has limitations in implementing this algorithm. In particular, the second - stage calculation (i.e., the use of the look - up table) cannot fully utilize the high performance of Tensor Core. Therefore, the paper suggests adding special CUDA cores in the next - generation GPU so as to achieve a performance level similar to that of Tensor Core during the use of the look - up table. ### Conclusion In conclusion, this paper proposes a new algorithm, msGeMM, which significantly improves the performance of GeMM operations in AI models by taking advantage of the characteristics of low - precision data types. However, to fully exploit the advantages of this algorithm, hardware - level support is also required.

Look-Up mAI GeMM: Increasing AI GeMMs Performance by Nearly 2.5x via msGeMM

GEMEL: Model Merging for Memory-Efficient, Real-Time Video Analytics at the Edge

DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables

NGEMM: Optimizing GEMM for Deep Learning via Compiler-based Techniques

DGEMM on Integer Matrix Multiplication Unit

Understanding GEMM Performance and Energy on NVIDIA Ada Lovelace: A Machine Learning-Based Analytical Approach

PF‐GEMV: Utilization maximizing architecture in fast matrix–vector multiplication for GPT‐2 inference

T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores

Predicting optimal sparse general matrix-matrix multiplication algorithm on GPUs

OpenGeMM: A High-Utilization GeMM Accelerator Generator with Lightweight RISC-V Control and Tight Memory Coupling

An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs.

A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs

Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication

Improving Performance of Matrix Multiplication and FFT on GPU

High Performance Evaluation of the Interpolations and Anterpolations in the GPU-Accelerated Massively Parallel MLFMA

LookupFFN: Making Transformers Compute-lite for CPU inference

Optimizing sparse general matrix–matrix multiplication for DCUs

IM-Unpack: Training and Inference with Arbitrarily Low Precision Integers

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU