Abstract:Transformer-based large language models (LLMs) have achieved remarkable success as model sizes continue to grow, yet their deployment remains challenging due to significant computational and memory demands. Quantization has emerged as a promising solution, and state-of-the-art quantization algorithms for LLMs introduce the need for mixed-precision matrix multiplication (mpGEMM), where lower-precision weights are multiplied with higher-precision activations. Despite its benefits, current hardware accelerators such as GPUs and TPUs lack native support for efficient mpGEMM, leading to inefficient dequantization operations in the main sequential loop. To address this limitation, we introduce MixPE, a specialized mixed-precision processing element designed for efficient low-bit quantization in LLM inference. MixPE leverages two key innovations to minimize dequantization overhead and unlock the full potential of low-bit quantization. First, recognizing that scale and zero point are shared within each quantization group, we propose performing dequantization after per-group mpGEMM, significantly reducing dequantization overhead. Second, instead of relying on conventional multipliers, MixPE utilizes efficient shift\&add operations for multiplication, optimizing both computation and energy efficiency. Our experimental results demonstrate that MixPE surpasses the state-of-the-art quantization accelerators by $2.6\times$ speedup and $1.4\times$ energy reduction.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the high computational and memory requirements faced by large - language models (LLMs) during the reasoning process. Although quantization techniques can significantly reduce these requirements, existing hardware accelerators such as GPUs and TPUs lack effective support for mixed - precision matrix multiplication (mpGEMM), resulting in the need for a low - precision - to - high - precision de - quantization operation for weights in the main loop, which becomes a performance bottleneck. To address this issue, the paper proposes a dedicated mixed - precision processing unit named MixPE, aiming to optimize the inference efficiency of low - bit - quantized LLMs by minimizing the de - quantization overhead. ### Main contributions of the paper: 1. **Propose MixPE**: A new hardware accelerator that can efficiently handle mixed - precision matrix multiplication by optimizing the quantization scheme and processing unit design, while postponing the de - quantization operation until after each set of mpGEMM calculations, thereby significantly reducing the de - quantization overhead. 2. **Design space exploration framework**: Provide a parameterized design space exploration (DSE) framework for evaluating the performance of different GEMM accelerators and determining the optimal trade - off between numerical precision and hardware efficiency. 3. **Experimental results**: The experimental results show that, compared with the state - of - the - art quantization accelerators, MixPE achieves a 2.6 - fold speed - up and 1.4 - fold energy savings under W4A8 quantization; under W4A16 quantization, it achieves a 2.44 - fold speed - up and reduces energy consumption by 68% compared with the traditional FP16 multiplication PE. ### Key technical points: - **Mixed - precision matrix multiplication (mpGEMM)**: In LLM inference, weights are usually quantized to a lower precision (such as INT4), while activation values maintain a higher precision (such as INT8 or FP16). MixPE improves computational efficiency by directly performing mixed - precision multiplication and optimizing the de - quantization process. - **De - quantization optimization**: MixPE takes advantage of the scale factor and zero - point sharing characteristics within each quantization group to perform de - quantization after each set of mpGEMM calculations, reducing the frequency and overhead of de - quantization operations. - **Shift and add operations**: MixPE uses efficient shift and add operations to replace traditional multipliers, especially in INT4 and INT8 multiplications. By using shift operations to achieve fast multiplication, it significantly reduces power consumption and increases throughput. ### Experimental verification: - **Hardware implementation**: The design of MixPE was implemented on the Xilinx Zynq UltraScale+ ZCU104 evaluation board, synthesized using Verilog RTL code, and the resource utilization and static/dynamic power consumption were evaluated. - **Performance comparison**: A comparison was made with baseline methods such as traditional INT8 PE, FP16 PE, BitFusion, and OLAccel, and the results show that MixPE has significant advantages in terms of speed and energy efficiency. In conclusion, by introducing MixPE, this paper solves the efficiency problem of existing hardware in handling mixed - precision matrix multiplication and provides a new solution for the efficient inference of large - scale language models.

MixPE: Quantization and Hardware Co-design for Efficient LLM Inference

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design

Pse: Mixed Quantization Framework of Neural Networks for Efficient Deployment

Progressive Mixed-Precision Decoding for Efficient LLM Inference

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

Quantization and Hardware Architecture Co-Design for Matrix-Vector Multiplications of Large Language Models

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization

MixMixQ: Quantization with Mixed Bit-Sparsity and Mixed Bit-Width for CIM Accelerators

Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge

LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Channel-Wise Mixed-Precision Quantization for Large Language Models

A Speed Odyssey for Deployable Quantization of LLMs

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM

Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization

LUT Tensor Core: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration

Fast Matrix Multiplications for Lookup Table-Quantized LLMs

SqueezeLLM: Dense-and-Sparse Quantization