Abstract:The Large Language Models (LLMs) have been popular and widely used in creative ways because of their powerful capabilities. However, the substantial model size and complexity prevent LLMs from being implemented on resourceconstrained computing devices efficiently. Recent works that utilize quantization to boost the on-device efficiency of LLMs show that 8-bit or lower weight quantization is feasible with minimal impact on task performance with full-precision activation. Nevertheless, weight-only quantization does not leverage the potential acceleration on general edge processors, which typically support 16x16 and 8x8 integer multipliers. Additionally, it cannot benefit from specialized devices like Field-Programmable Gate Arrays (FPGAs), which offer reconfigure computing to design custom multipliers of arbitrary bit width. In this paper, we propose HotaQ, a Hardware-oriented token adaptive Quantization framework for general LLMs, and implement end-to-end accelerators on multiple devices. Considering the hardware profiling and activation analysis, we first introduce a basic activation quantization strategy to balance the trade-off of task performance and real inference speed. Then we integrate the activation-aware token pruning workflow to reduce the outliers and the adverse impact on attentivity. For mobile devices, we notably design the SIMD-based 4-bit multiplier and the efficient TRIP matrix multiplication for mobile devices. For FPGAs, we specially utilize the DSP packing techniques for 4/8-bit systolic-array-based multipliers and introduce an analytical model for achieving the best performance with limited on-chip resources. We apply our framework on different scales of LLMs, including LLaMA, OPT, and BLOOM with 4/8-bit for the activation and 4-bit for the weight. Experiments show that HotaQ achieves quantization on both model weights and activations while maintaining task performance comparable to existing weight-only quantization methods or even FP16 models. In the 4/8-bit scenario, HotaQ achieves an on-device speedup of up to 2.5× and 5.2× compared to its FP16 counterparts across edge devices and FPGAs respectively, marking a pioneering advancement in this domain.

LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

Designing Efficient LLM Accelerators for Edge Devices

A Comprehensive Evaluation of FPGA-Based Spatial Acceleration of LLMs

Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference

EdgeLLM: A Highly Efficient CPU-FPGA Heterogeneous Edge Accelerator for Large Language Models

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization

Optimization of Armv9 architecture general large language model inference performance based on Llama.cpp

MECLA: Memory-Compute-Efficient LLM Accelerator with Scaling Sub-matrix Partition

The implementation of a Deep Recurrent Neural Network Language Model on a Xilinx FPGA

Efficient LLM inference solution on Intel GPU

FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis

Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores

A Speed Odyssey for Deployable Quantization of LLMs

C-LSTM: Enabling Efficient LSTM Using Structured Compression Techniques on FPGAs

FPGA-based Accelerator for Long Short-Term Memory Recurrent Neural Networks

Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching

Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM