Genetic Quantization-Aware Approximation for Non-Linear Operations in Transformers

Pingcheng Dong,Yonghao Tan,Dong Zhang,Tianwei Ni,Xuejiao Liu,Yu Liu,Peng Luo,Luhong Liang,Shih-Yang Liu,Xijie Huang,Huaiyu Zhu,Yun Pan,Fengwei An,Kwang-Ting Cheng

2024-03-29

Abstract:Non-linear functions are prevalent in Transformers and their lightweight variants, incurring substantial and frequently underestimated hardware costs. Previous state-of-the-art works optimize these operations by piece-wise linear approximation and store the parameters in look-up tables (LUT), but most of them require unfriendly high-precision arithmetics such as FP/INT 32 and lack consideration of integer-only INT quantization. This paper proposed a genetic LUT-Approximation algorithm namely GQA-LUT that can automatically determine the parameters with quantization awareness. The results demonstrate that GQA-LUT achieves negligible degradation on the challenging semantic segmentation task for both vanilla and linear Transformer models. Besides, proposed GQA-LUT enables the employment of INT8-based LUT-Approximation that achieves an area savings of 81.3~81.7% and a power reduction of 79.3~80.2% compared to the high-precision FP/INT 32 alternatives. Code is available at https://

Machine Learning,Hardware Architecture,Neural and Evolutionary Computing

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the high hardware cost and insufficient optimization of nonlinear operations in the Transformer model. Specifically: 1. **Requirement for high - precision operations**: The nonlinear functions (such as GELU, Softmax, and LayerNorm) in existing Transformers and their lightweight variants usually require high - precision operations (for example, 32 - bit floating - point or integer), which leads to significant hardware overhead. 2. **Insufficient quantization awareness**: Although previous studies have used piece - wise linear approximation (PWL) and look - up table (LUT) to store parameters, most of these methods rely on unfriendly high - precision arithmetic (such as FP/INT 32) and lack consideration for integer - only INT quantization. 3. **Waste of hardware resources**: Directly applying high - precision LUT approximation to low - precision inputs (such as INT8) will lead to resource waste because the expressive ability of INT8 is far less than that of FP/INT32. To solve these problems, the paper proposes a Genetic Quantization - Aware Approximation algorithm (GQA - LUT), and its main contributions are as follows: - **Quantization - aware LUT approximation calculation process**: It deeply analyzes the relationship between the scaling factor and LUT parameters and proposes a general quantization - aware LUT approximation calculation process. - **Genetic algorithm for automatically determining breakpoints**: It proposes a genetic algorithm GQA - LUT to automatically determine the LUT approximation breakpoints of nonlinear functions, overcoming the limitation of existing quantization algorithms that cannot adjust parameters according to the scaling factor. - **Rounding mutation algorithm**: It introduces a Rounding Mutation (RM) algorithm, which incorporates rounding error into GQA - LUT, solving the breakpoint deviation problem when dealing with infeasible scales. The experimental results show that GQA - LUT performs excellently in semantic segmentation tasks, achieving almost negligible performance degradation, and saves 81.3% - 81.7% and 79.3% - 80.2% in area and power consumption respectively compared with the high - precision FP/INT 32 scheme. In conclusion, this paper aims to achieve efficient and compact hardware design by proposing GQA - LUT and its improved technique RM, and use low - bit integer arithmetic to optimize nonlinear operations in the Transformer model.

Genetic Quantization-Aware Approximation for Non-Linear Operations in Transformers

PackQViT: Faster Sub-8-bit Vision Transformers Via Full and Packed Quantization on the Mobile

Improving Transformer Inference Through Optimized Non-Linear Operations with Quantization-Approximation-Based Strategy

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

FrameQuant: Flexible Low-Bit Quantization for Transformers

Training Transformers with 4-bit Integers

PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization

Scaled Quantization for the Vision Transformer

Quantization Variation: A New Perspective on Training Transformers with Low-Bit Precision

Quantformer: Learning Extremely Low-precision Vision Transformers

TaQ-DiT: Time-aware Quantization for Diffusion Transformers

LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models

Resource-aware Mixed-precision Quantization for Enhancing Deployability of Transformers for Time-series Forecasting on Embedded FPGAs

An Analysis on Quantizing Diffusion Transformers

Model Quantization and Hardware Acceleration for Vision Transformers: A Comprehensive Survey

Mixed Non-linear Quantization for Vision Transformers

TEQ: Trainable Equivalent Transformation for Quantization of LLMs

ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers

Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding

Q-HyViT: Post-Training Quantization of Hybrid Vision Transformers with Bridge Block Reconstruction for IoT Systems