Genetic Quantization-Aware Approximation for Non-Linear Operations in Transformers

Pingcheng Dong,Yonghao Tan,Dong Zhang,Tianwei Ni,Xuejiao Liu,Yu Liu,Peng Luo,Luhong Liang,Shih-Yang Liu,Xijie Huang,Huaiyu Zhu,Yun Pan,Fengwei An,Kwang-Ting Cheng
2024-03-29
Abstract:Non-linear functions are prevalent in Transformers and their lightweight variants, incurring substantial and frequently underestimated hardware costs. Previous state-of-the-art works optimize these operations by piece-wise linear approximation and store the parameters in look-up tables (LUT), but most of them require unfriendly high-precision arithmetics such as FP/INT 32 and lack consideration of integer-only INT quantization. This paper proposed a genetic LUT-Approximation algorithm namely GQA-LUT that can automatically determine the parameters with quantization awareness. The results demonstrate that GQA-LUT achieves negligible degradation on the challenging semantic segmentation task for both vanilla and linear Transformer models. Besides, proposed GQA-LUT enables the employment of INT8-based LUT-Approximation that achieves an area savings of 81.3~81.7% and a power reduction of 79.3~80.2% compared to the high-precision FP/INT 32 alternatives. Code is available at https://
Machine Learning,Hardware Architecture,Neural and Evolutionary Computing
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the high hardware cost and insufficient optimization of nonlinear operations in the Transformer model. Specifically: 1. **Requirement for high - precision operations**: The nonlinear functions (such as GELU, Softmax, and LayerNorm) in existing Transformers and their lightweight variants usually require high - precision operations (for example, 32 - bit floating - point or integer), which leads to significant hardware overhead. 2. **Insufficient quantization awareness**: Although previous studies have used piece - wise linear approximation (PWL) and look - up table (LUT) to store parameters, most of these methods rely on unfriendly high - precision arithmetic (such as FP/INT 32) and lack consideration for integer - only INT quantization. 3. **Waste of hardware resources**: Directly applying high - precision LUT approximation to low - precision inputs (such as INT8) will lead to resource waste because the expressive ability of INT8 is far less than that of FP/INT32. To solve these problems, the paper proposes a Genetic Quantization - Aware Approximation algorithm (GQA - LUT), and its main contributions are as follows: - **Quantization - aware LUT approximation calculation process**: It deeply analyzes the relationship between the scaling factor and LUT parameters and proposes a general quantization - aware LUT approximation calculation process. - **Genetic algorithm for automatically determining breakpoints**: It proposes a genetic algorithm GQA - LUT to automatically determine the LUT approximation breakpoints of nonlinear functions, overcoming the limitation of existing quantization algorithms that cannot adjust parameters according to the scaling factor. - **Rounding mutation algorithm**: It introduces a Rounding Mutation (RM) algorithm, which incorporates rounding error into GQA - LUT, solving the breakpoint deviation problem when dealing with infeasible scales. The experimental results show that GQA - LUT performs excellently in semantic segmentation tasks, achieving almost negligible performance degradation, and saves 81.3% - 81.7% and 79.3% - 80.2% in area and power consumption respectively compared with the high - precision FP/INT 32 scheme. In conclusion, this paper aims to achieve efficient and compact hardware design by proposing GQA - LUT and its improved technique RM, and use low - bit integer arithmetic to optimize nonlinear operations in the Transformer model.