Accelerating a Triton Fused Kernel for W4A16 Quantized Inference with SplitK work decomposition

Adnan Hoque,Less Wright,Chih-Chieh Yang,Mudhakar Srivatsa,Raghu Ganti
2024-02-23
Abstract:We propose an implementation of an efficient fused matrix multiplication kernel for W4A16 quantized inference, where we perform dequantization and GEMM in a fused kernel using a SplitK work decomposition. Our implementation shows improvement for the type of skinny matrix-matrix multiplications found in foundation model inference workloads. In particular, this paper surveys the type of matrix multiplication between a skinny activation matrix and a square weight matrix. Our results show an average of 65% speed improvement on A100, and an average of 124% speed improvement on H100 (with a peak of 295%) for a range of matrix dimensions including those found in a llama-style model, where m < n = k.
Distributed, Parallel, and Cluster Computing,Artificial Intelligence
What problem does this paper attempt to address?