Abstract:The Large Language Models (LLMs) have been popular and widely used in creative ways because of their powerful capabilities. However, the substantial model size and complexity prevent LLMs from being implemented on resourceconstrained computing devices efficiently. Recent works that utilize quantization to boost the on-device efficiency of LLMs show that 8-bit or lower weight quantization is feasible with minimal impact on task performance with full-precision activation. Nevertheless, weight-only quantization does not leverage the potential acceleration on general edge processors, which typically support 16x16 and 8x8 integer multipliers. Additionally, it cannot benefit from specialized devices like Field-Programmable Gate Arrays (FPGAs), which offer reconfigure computing to design custom multipliers of arbitrary bit width. In this paper, we propose HotaQ, a Hardware-oriented token adaptive Quantization framework for general LLMs, and implement end-to-end accelerators on multiple devices. Considering the hardware profiling and activation analysis, we first introduce a basic activation quantization strategy to balance the trade-off of task performance and real inference speed. Then we integrate the activation-aware token pruning workflow to reduce the outliers and the adverse impact on attentivity. For mobile devices, we notably design the SIMD-based 4-bit multiplier and the efficient TRIP matrix multiplication for mobile devices. For FPGAs, we specially utilize the DSP packing techniques for 4/8-bit systolic-array-based multipliers and introduce an analytical model for achieving the best performance with limited on-chip resources. We apply our framework on different scales of LLMs, including LLaMA, OPT, and BLOOM with 4/8-bit for the activation and 4-bit for the weight. Experiments show that HotaQ achieves quantization on both model weights and activations while maintaining task performance comparable to existing weight-only quantization methods or even FP16 models. In the 4/8-bit scenario, HotaQ achieves an on-device speedup of up to 2.5× and 5.2× compared to its FP16 counterparts across edge devices and FPGAs respectively, marking a pioneering advancement in this domain.

Mitigating Outlier Activations in Low-Precision Fine-Tuning of Language Models

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

OutlierTune: Efficient Channel-Wise Quantization for Large Language Models

Outlier Suppression+: Accurate Quantization of Large Language Models by Equivalent and Effective Shifting and Scaling

Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models

Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization

Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques

Taming Sensitive Weights : Noise Perturbation Fine-tuning for Robust LLM Quantization

OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models

Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization

Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models

Outliers and Calibration Sets have Diminishing Effect on Quantization of Modern LLMs

ApiQ: Finetuning of 2-Bit Quantized Large Language Model

AMXFP4: Taming Activation Outliers with Asymmetric Microscaling Floating-Point for 4-bit LLM Inference

Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

INT2.1: Towards Fine-Tunable Quantized Large Language Models with Error Correction through Low-Rank Adaptation

MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization

OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization

OPAL: Outlier-Preserved Microscaling Quantization Accelerator for Generative Large Language Models

A Study of Optimizations for Fine-tuning Large Language Models

Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference