UN-η: an Offline Adaptive Normalization Method for Deploying Transformers
Zheyang Li,Kai Zhang,Qiming Yang,Chaoxiang Lan,Huanlong Zhang,Wenming Tan,Jun Xiao,Shiliang Pu
DOI: https://doi.org/10.1016/j.knosys.2024.112141
IF: 8.139
2024-01-01
Knowledge-Based Systems
Abstract:Transformer has become the de-facto architecture for many natural language and vision tasks thanks to its remarkable performance. As an essential part of transformers, normalization functions struggle to improve the robustness and performance of transformer architecture. In this work, we build up a unified framework to describe and compare the advantages and disadvantages of different normalization methods in transformers. As an online normalization method, Layer Normalization (LN) is initially used to normalize activations inside each token to handle variant-length inputs and improve robustness. However, the strong robustness and performance of online normalization methods is at the cost of inefficient inference. On the contrary, Transformers built with hardware-efficient offline normalization schemes such as Batch Normalization, yields subpar performance sometimes even collapsing during training. Furthermore, we discover that abnormal behaviors in activation statistics, such as significant iteration-to-iteration fluctuations and extreme outliers across layers, make a vital effect on the training process and the final result. Based on the unified framework, we propose Unified Normalization with eta (UN-eta), achieving low-cost inference and good performance. We theoretically prove the effectiveness of our method. Experimental results on various tasks including language, speech, and vision tasks have demonstrated that UN-eta can be a hardware-efficient substitute for LN. Besides, we assess the hardware efficiency of our method on GPU.