T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

Jianyu Wei,Shijie Cao,Ting Cao,Lingxiao Ma,Lei Wang,Yanyong Zhang,Mao Yang

2024-06-25

Abstract:The deployment of Large Language Models (LLMs) on edge devices is increasingly important to enhance on-device intelligence. Weight quantization is crucial for reducing the memory footprint of LLMs on devices. However, low-bit LLMs necessitate mixed precision matrix multiplication (mpGEMM) of low precision weights and high precision activations during inference. Existing systems, lacking native support for mpGEMM, resort to dequantize weights for high precision computation. Such an indirect way can lead to a significant inference overhead. In this paper, we introduce T-MAC, an innovative lookup table(LUT)-based method designed for efficient low-bit LLM (i.e., weight-quantized LLM) inference on CPUs. T-MAC directly supports mpGEMM without dequantization, while simultaneously eliminating multiplications and reducing additions required. Specifically, T-MAC transforms the traditional data-type-centric multiplication to bit-wise table lookup, and enables a unified and scalable mpGEMM solution. Our LUT-based kernels scale linearly to the weight bit-width. Evaluated on low-bit Llama and BitNet models, T-MAC demonstrates up to 4x increase in throughput and 70% reduction in energy consumption compared to llama.cpp. For BitNet-b1.58-3B, T-MAC delivers a token generation throughput of 30 tokens/s with a single core and 71 tokens/s with eight cores on M2-Ultra, and 11 tokens/s on lower-end devices like Raspberry Pi 5, which significantly exceeds the adult average reading speed. T-MAC with LUT-based computing paradigm, paves the way for the practical deployment of low-bit LLMs on resource-constrained edge devices without compromising computational efficiency. The system is open-sourced at <a class="link-external link-https" href="https://github.com/microsoft/T-MAC" rel="external noopener nofollow">this https URL</a>.

Distributed, Parallel, and Cluster Computing,Artificial Intelligence

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the critical issues encountered when deploying large language models (LLMs) on edge devices, particularly the problem of mixed-precision General Matrix Multiply (mpGEMM) with low-bit quantized weights during inference. Specifically: 1. **LLM Deployment under Hardware Resource Constraints**: - Edge devices such as smartphones and desktop computers typically have limited memory resources, while LLM models often require a large amount of memory to store their parameters. - To adapt to these resource-constrained environments, the paper proposes a lookup table (LUT)-based method—T-MAC, to achieve efficient and low-power inference. 2. **Support for Mixed-Precision Computation**: - Most current hardware architectures (e.g., CPU, GPU) do not directly support mixed-precision computation between low-bit quantized weights and high-precision activation values. - Existing solutions convert low-precision weights to higher precision data types through dequantization, but this approach adds extra computational overhead and does not fully leverage the advantages of low-bit weights. 3. **Unified and Scalable Design**: - The proposed method can handle combinations of weights and activation values with different bit widths and reduce multiplication operations through the use of lookup tables, thereby lowering computational complexity. - The design of T-MAC allows it to run efficiently on CPUs of various edge devices without relying on specialized accelerators or GPUs. Through these improvements, T-MAC can significantly enhance the inference speed and energy efficiency of low-bit quantized LLMs on edge devices without sacrificing computational efficiency.

T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

A Low-Power In-Memory Multiplication and Accumulation Array with Modified Radix-4 Input and Canonical Signed Digit Weights

A Robust 8-Bit Non-Volatile Computing-in-Memory Core for Low-Power Parallel MAC Operations.

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

LUT Tensor Core: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration

In-Memory Multi-Bit Multiplication and Accumulation (MAC) Using FeFET for Energy Efficient IoT

QuantMAC: Enhancing Hardware Performance in DNNs With Quantize Enabled Multiply-Accumulate Unit

MixPE: Quantization and Hardware Co-design for Efficient LLM Inference

Table-Lookup MAC: Scalable Processing of Quantised Neural Networks in FPGA Soft Logic

Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge

Trident-CIM: A LUT-Based Compute-in-Memory Macro with Trident Read Bit-Line and Partial Product Pruning

Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM

DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables

Fast Matrix Multiplications for Lookup Table-Quantized LLMs

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization

1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs

TMA: Tera-MACs/W Neural Hardware Inference Accelerator with a Multiplier-less Massive Parallel Processor

Look-Up Table based Neural Network Hardware

Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models