Abstract:The Large Language Models (LLMs) have been popular and widely used in creative ways because of their powerful capabilities. However, the substantial model size and complexity prevent LLMs from being implemented on resourceconstrained computing devices efficiently. Recent works that utilize quantization to boost the on-device efficiency of LLMs show that 8-bit or lower weight quantization is feasible with minimal impact on task performance with full-precision activation. Nevertheless, weight-only quantization does not leverage the potential acceleration on general edge processors, which typically support 16x16 and 8x8 integer multipliers. Additionally, it cannot benefit from specialized devices like Field-Programmable Gate Arrays (FPGAs), which offer reconfigure computing to design custom multipliers of arbitrary bit width. In this paper, we propose HotaQ, a Hardware-oriented token adaptive Quantization framework for general LLMs, and implement end-to-end accelerators on multiple devices. Considering the hardware profiling and activation analysis, we first introduce a basic activation quantization strategy to balance the trade-off of task performance and real inference speed. Then we integrate the activation-aware token pruning workflow to reduce the outliers and the adverse impact on attentivity. For mobile devices, we notably design the SIMD-based 4-bit multiplier and the efficient TRIP matrix multiplication for mobile devices. For FPGAs, we specially utilize the DSP packing techniques for 4/8-bit systolic-array-based multipliers and introduce an analytical model for achieving the best performance with limited on-chip resources. We apply our framework on different scales of LLMs, including LLaMA, OPT, and BLOOM with 4/8-bit for the activation and 4-bit for the weight. Experiments show that HotaQ achieves quantization on both model weights and activations while maintaining task performance comparable to existing weight-only quantization methods or even FP16 models. In the 4/8-bit scenario, HotaQ achieves an on-device speedup of up to 2.5× and 5.2× compared to its FP16 counterparts across edge devices and FPGAs respectively, marking a pioneering advancement in this domain.

Sub-8-Bit Quantization Aware Training for 8-Bit Neural Network Accelerator with On-Device Speech Recognition

Sub-8-bit quantization for on-device speech recognition: a regularization-free approach

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

Accelerator-Aware Training for Transducer-Based Speech Recognition

Accelerating RNN-based Speech Enhancement on a Multi-Core MCU with Mixed FP16-INT8 Post-Training Quantization

Novel adaptive quantization methodology for 8-bit floating-point DNN training

Bit-shrinking: Limiting Instantaneous Sharpness for Improving Post-training Quantization

AdaQAT: Adaptive Bit-Width Quantization-Aware Training

EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge

Exploiting Retraining-Based Mixed-Precision Quantization for Low-Cost DNN Accelerator Design

Towards Accurate and Efficient Sub-8-Bit Integer Training

Towards Lightweight Speaker Verification via Adaptive Neural Network Quantization

Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge

SQUAT: Stateful Quantization-Aware Training in Recurrent Spiking Neural Networks

DTQAtten: Leveraging Dynamic Token-based Quantization for Efficient Attention Architecture

"It is okay to be uncommon": Quantizing Sound Event Detection Networks on Hardware Accelerators with Uncommon Sub-Byte Support

Mixed Precision Low-bit Quantization of Neural Network Language Models for Speech Recognition

SQuantizer: Simultaneous Learning for Both Sparse and Low-precision Neural Networks

VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference

Alternating Multi-bit Quantization for Recurrent Neural Networks