Abstract:The Large Language Models (LLMs) have been popular and widely used in creative ways because of their powerful capabilities. However, the substantial model size and complexity prevent LLMs from being implemented on resourceconstrained computing devices efficiently. Recent works that utilize quantization to boost the on-device efficiency of LLMs show that 8-bit or lower weight quantization is feasible with minimal impact on task performance with full-precision activation. Nevertheless, weight-only quantization does not leverage the potential acceleration on general edge processors, which typically support 16x16 and 8x8 integer multipliers. Additionally, it cannot benefit from specialized devices like Field-Programmable Gate Arrays (FPGAs), which offer reconfigure computing to design custom multipliers of arbitrary bit width. In this paper, we propose HotaQ, a Hardware-oriented token adaptive Quantization framework for general LLMs, and implement end-to-end accelerators on multiple devices. Considering the hardware profiling and activation analysis, we first introduce a basic activation quantization strategy to balance the trade-off of task performance and real inference speed. Then we integrate the activation-aware token pruning workflow to reduce the outliers and the adverse impact on attentivity. For mobile devices, we notably design the SIMD-based 4-bit multiplier and the efficient TRIP matrix multiplication for mobile devices. For FPGAs, we specially utilize the DSP packing techniques for 4/8-bit systolic-array-based multipliers and introduce an analytical model for achieving the best performance with limited on-chip resources. We apply our framework on different scales of LLMs, including LLaMA, OPT, and BLOOM with 4/8-bit for the activation and 4-bit for the weight. Experiments show that HotaQ achieves quantization on both model weights and activations while maintaining task performance comparable to existing weight-only quantization methods or even FP16 models. In the 4/8-bit scenario, HotaQ achieves an on-device speedup of up to 2.5× and 5.2× compared to its FP16 counterparts across edge devices and FPGAs respectively, marking a pioneering advancement in this domain.

Bit-Offsetter: A Bit-serial DNN Accelerator with Weight-offset MAC for Bit-wise Sparsity Exploitation

Bit-balance: Model-Hardware Co-design for Accelerating NNs by Exploiting Bit-level Sparsity

BBS: Bi-directional Bit-level Sparsity for Deep Learning Acceleration

BEM: Bit-level Sparsity-aware Deep Learning Accelerator with Efficient Booth Encoding and Weight Multiplexing

BitSNNs: Revisiting Energy-efficient Spiking Neural Networks

BitCluster: Fine-Grained Weight Quantization for Load-Balanced Bit-Serial Neural Network Accelerators

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

Bit-serial Weight Pools: Compression and Arbitrary Precision Execution of Neural Networks on Resource Constrained Processors

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

A Low-Power Sparse Convolutional Neural Network Accelerator with Pre-Encoding Radix-4 Booth Multiplier

EncodingNet: A Novel Encoding-based MAC Design for Efficient Neural Network Acceleration

Bit Error Robustness for Energy-Efficient DNN Accelerators

QuantMAC: Enhancing Hardware Performance in DNNs With Quantize Enabled Multiply-Accumulate Unit

BitXpro: Regularity-Aware Hardware Runtime Pruning for Deep Neural Networks

Sparsity-Aware Optimization of In-Memory Bayesian Binary Neural Network Accelerators

Energy-efficient Dense DNN Acceleration with Signed Bit-slice Architecture

BitQ: Tailoring Block Floating Point Precision for Improved DNN Efficiency on Resource-Constrained Devices

ABS: Accumulation Bit-Width Scaling Method for Designing Low-Precision Tensor Core

Leveraging Bit-Serial Architectures for Hardware-Oriented Deep Learning Accelerators with Column-Buffering Dataflow

FAMES: Fast Approximate Multiplier Substitution for Mixed-Precision Quantized DNNs--Down to 2 Bits!

A High Performance Multi-Bit-Width Booth Vector Systolic Accelerator for NAS Optimized Deep Learning Neural Networks