Abstract:Large Language Models (LLMs) stand out for their impressive performance in intricate language modeling tasks. However, their demanding computational and memory needs pose obstacles for broad use on edge devices. Quantization is then introduced to boost LLMs' on-device efficiency. Recent works show that 8-bit or lower weight quantization is feasible with minimal impact on end-to-end task performance, while the activation is still not quantized. On the other hand, mainstream commodity edge devices still struggle to execute these sub-8-bit quantized networks effectively. In this paper, we propose Agile-Quant, an activation-guided quantization framework for popular Large Language Models (LLMs), and implement an end-to-end accelerator on multiple edge devices for faster inference. Considering the hardware profiling and activation analysis, we first introduce a basic activation quantization strategy to balance the trade-off of task performance and real inference speed. Then we leverage the activation-aware token pruning technique to reduce the outliers and the adverse impact on attentivity. Ultimately, we utilize the SIMD-based 4-bit multiplier and our efficient TRIP matrix multiplication to implement the accelerator for LLMs on the edge. We apply our framework on different scales of LLMs including LLaMA, OPT, and BLOOM with 4-bit or 8-bit for the activation and 4-bit for the weight quantization. Experiments show that Agile-Quant achieves simultaneous quantization of model weights and activations while maintaining task performance comparable to existing weight-only quantization methods. Moreover, in the 8- and 4-bit scenario, Agile-Quant achieves an on-device speedup of up to 2.55x compared to its FP16 counterparts across multiple edge devices, marking a pioneering advancement in this domain.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper "Agile-Quant: Activation-Guided Quantization for Accelerating Large Language Model Inference on Edge Devices" aims to address the computational and storage resource limitations faced when deploying large language models (LLMs) on edge devices. Specifically, the paper proposes a framework called Agile-Quant, which uses activation-guided quantization techniques to efficiently quantize the weights and activations of LLMs, thereby significantly improving inference speed on edge devices while maintaining task performance. ### Main Issues and Challenges 1. **High Computational and Storage Costs**: - Large language models (such as GPT-3) have an enormous number of parameters. For example, GPT-3-175B requires 326GB of memory in float16 format, which far exceeds the capacity of most single GPUs, let alone resource-constrained edge devices. - Traditional quantization methods mainly focus on weight quantization (4-bit), while activations still use floating-point numbers (FP16), limiting the acceleration effect on common edge devices, as these devices typically only support 16x16 and 8x8 integer multipliers. 2. **Impact of Activation Quantization**: - Directly quantizing activations can lead to a decline in task performance, especially for larger models, as significant outliers in activations can negatively impact task performance. - Experiments show that directly setting these outliers to zero can result in a 45% drop in task performance. 3. **Hardware Support Limitations**: - Mainstream edge processors (such as CPUs and Raspberry Pi) use SIMD units to perform parallel operations, but these units typically only support 8-bit or wider precision operations. - Existing low-precision linear algebra kernels (such as GEMMLOWP and QNNPACK) perform well at 8-bit quantization but do not provide additional performance improvements when further reducing precision to below 4-bit, as mainstream CPUs only support 8-bit and above SIMD operations. ### Solutions 1. **Activation-Guided Quantization Strategy**: - Through hardware latency analysis and activation analysis, a basic activation quantization strategy was designed to balance the trade-off between task performance and actual inference speed. - Introduced token pruning technology based on activations to reduce the adverse impact of outliers on the attention mechanism, thereby optimizing quantization effects. 2. **Hardware Optimization**: - Designed a SIMD-based 4-bit multiplier to support efficient 4x4 INT4 multiplication. - Proposed an efficient TRIP matrix multiplication to further mitigate the adverse effects of outliers. 3. **Experimental Validation**: - Experiments were conducted on different scales of LLMs (such as LLaMA, OPT, and BLOOM) to verify the effectiveness and efficiency of the Agile-Quant framework. - Experimental results show that Agile-Quant achieved up to 2.55 times inference acceleration on multiple edge devices in 8-bit and 4-bit quantization scenarios compared to FP16 models, while maintaining task performance comparable to existing weight quantization methods. ### Summary The paper addresses the computational and storage resource limitations of deploying large language models on edge devices by proposing the Agile-Quant framework. Through activation-guided quantization techniques and hardware optimization, it significantly improves inference speed on edge devices while maintaining task performance. This achievement provides important technical support for the widespread application of large language models on edge devices.

Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

MobileQuant: Mobile-friendly Quantization for On-device Language Models

EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models

A Speed Odyssey for Deployable Quantization of LLMs

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM

SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM

QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization

SqueezeLLM: Dense-and-Sparse Quantization

I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models

OPAL: Outlier-Preserved Microscaling Quantization Accelerator for Generative Large Language Models

Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?

Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization

EXAQ: Exponent Aware Quantization For LLMs Acceleration