T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

Jianyu Wei,Shijie Cao,Ting Cao,Lingxiao Ma,Lei Wang,Yanyong Zhang,Mao Yang
2024-06-25
Abstract:The deployment of Large Language Models (LLMs) on edge devices is increasingly important to enhance on-device intelligence. Weight quantization is crucial for reducing the memory footprint of LLMs on devices. However, low-bit LLMs necessitate mixed precision matrix multiplication (mpGEMM) of low precision weights and high precision activations during inference. Existing systems, lacking native support for mpGEMM, resort to dequantize weights for high precision computation. Such an indirect way can lead to a significant inference overhead. In this paper, we introduce T-MAC, an innovative lookup table(LUT)-based method designed for efficient low-bit LLM (i.e., weight-quantized LLM) inference on CPUs. T-MAC directly supports mpGEMM without dequantization, while simultaneously eliminating multiplications and reducing additions required. Specifically, T-MAC transforms the traditional data-type-centric multiplication to bit-wise table lookup, and enables a unified and scalable mpGEMM solution. Our LUT-based kernels scale linearly to the weight bit-width. Evaluated on low-bit Llama and BitNet models, T-MAC demonstrates up to 4x increase in throughput and 70% reduction in energy consumption compared to llama.cpp. For BitNet-b1.58-3B, T-MAC delivers a token generation throughput of 30 tokens/s with a single core and 71 tokens/s with eight cores on M2-Ultra, and 11 tokens/s on lower-end devices like Raspberry Pi 5, which significantly exceeds the adult average reading speed. T-MAC with LUT-based computing paradigm, paves the way for the practical deployment of low-bit LLMs on resource-constrained edge devices without compromising computational efficiency. The system is open-sourced at <a class="link-external link-https" href="https://github.com/microsoft/T-MAC" rel="external noopener nofollow">this https URL</a>.
Distributed, Parallel, and Cluster Computing,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the critical issues encountered when deploying large language models (LLMs) on edge devices, particularly the problem of mixed-precision General Matrix Multiply (mpGEMM) with low-bit quantized weights during inference. Specifically: 1. **LLM Deployment under Hardware Resource Constraints**: - Edge devices such as smartphones and desktop computers typically have limited memory resources, while LLM models often require a large amount of memory to store their parameters. - To adapt to these resource-constrained environments, the paper proposes a lookup table (LUT)-based method—T-MAC, to achieve efficient and low-power inference. 2. **Support for Mixed-Precision Computation**: - Most current hardware architectures (e.g., CPU, GPU) do not directly support mixed-precision computation between low-bit quantized weights and high-precision activation values. - Existing solutions convert low-precision weights to higher precision data types through dequantization, but this approach adds extra computational overhead and does not fully leverage the advantages of low-bit weights. 3. **Unified and Scalable Design**: - The proposed method can handle combinations of weights and activation values with different bit widths and reduce multiplication operations through the use of lookup tables, thereby lowering computational complexity. - The design of T-MAC allows it to run efficiently on CPUs of various edge devices without relying on specialized accelerators or GPUs. Through these improvements, T-MAC can significantly enhance the inference speed and energy efficiency of low-bit quantized LLMs on edge devices without sacrificing computational efficiency.