Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models

Zhijian Zhuo,Ya Wang,Yutao Zeng,Xiaoqing Li,Xun Zhou,Jinwen Ma
2024-11-06
Abstract:Transformers have found extensive applications across various domains due to the powerful fitting capabilities. This success can be partially attributed to their inherent nonlinearity. Thus, in addition to the ReLU function employed in the original transformer architecture, researchers have explored alternative modules such as GeLU and SwishGLU to enhance nonlinearity and thereby augment representational capacity. In this paper, we propose a novel category of polynomial composition activations (PolyCom), designed to optimize the dynamics of transformers. Theoretically, we provide a comprehensive mathematical analysis of PolyCom, highlighting its enhanced expressivity and efficacy relative to other activation functions. Notably, we demonstrate that networks incorporating PolyCom achieve the $\textbf{optimal approximation rate}$, indicating that PolyCom networks require minimal parameters to approximate general smooth functions in Sobolev spaces. We conduct empirical experiments on the pre-training configurations of large language models (LLMs), including both dense and sparse architectures. By substituting conventional activation functions with PolyCom, we enable LLMs to capture higher-order interactions within the data, thus improving performance metrics in terms of accuracy and convergence rates. Extensive experimental results demonstrate the effectiveness of our method, showing substantial improvements over other activation functions. Code is available at <a class="link-external link-https" href="https://github.com/BryceZhuo/PolyCom" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the problem of activation function selection in large - language models (LLMs). Specifically, the paper proposes a new class of activation functions - Polynomial - Combined activation functions (PolyCom) to optimize the dynamic characteristics of the Transformer model. Traditional activation functions such as ReLU and its variants perform well in terms of computational efficiency and ease of implementation, but have limitations in modeling complex high - order relationships. These limitations are particularly evident in the Transformer architecture because the Transformer needs to capture subtle and complex dependencies in the data. ### Main contributions 1. **Propose a new activation function PolyCom**: - Introduce two specific forms of PolyCom: PolyReLU and PolyNorm, and explain in detail how they are integrated into the Transformer architecture. 2. **Theoretical analysis**: - Derive the bounds on the number of training parameters required for the PolyReLU network and its optimal approximation rate in Sobolev space. - Prove that the PolyReLU network can accurately represent the ReLU network without increasing the model size. - Provide the upper and lower bounds for approximating the PolyReLU network with the ReLU network, indicating that the PolyReLU network is more efficient in representational power. 3. **Experimental verification**: - Conduct extensive experiments on a 1 - billion - parameter dense model and a 1 - billion - active - parameter, 7 - billion - total - parameter Mixture - of - Experts (MoE) model. - The experimental results show that the models using PolyCom outperform other activation functions such as SwiGLU, GELU, and ReLU in terms of training loss, validation perplexity, and downstream task performance. ### Abstract The Transformer model is widely used in multiple fields due to its strong fitting ability, which is largely attributed to its inherent nonlinear characteristics. Besides the ReLU function used in the original Transformer architecture, researchers have also explored other modules such as GeLU and SwishGLU to enhance nonlinearity and improve representational ability. This paper proposes a new Polynomial - Combined activation function (PolyCom) aiming to optimize the dynamic characteristics of the Transformer. Theoretically, we conduct a comprehensive mathematical analysis of PolyCom, demonstrating its enhanced expressive power and effectiveness compared to other activation functions. Experimental results show that large - language models using PolyCom can capture higher - order data interactions, thus achieving significant improvements in accuracy and convergence speed. ### Key technical details - **Polynomial - Combined activation function (PolyCom)**: - **PolyReLU**: Defined as \( \text{PolyReLU}(x)=\sum_{i = 0}^{r}a_i\text{ReLU}^i(x) \), where \( \text{ReLU}^i(x)=\max\{x, 0\}^i \). - **PolyNorm**: Defined as \( \text{PolyNorm}(x)=\sum_{i = 0}^{r}a_i\frac{x^i}{\|x^i\|_2} \), where \( x^i \) represents the element - wise power operation of the input tensor \( x \), and \( \| \cdot \|_2 \) represents L2 normalization. - **Theoretical analysis**: - **Approximating ReLU networks**: Prove that the PolyReLU network can accurately represent the ReLU network, and the number of required parameters is comparable to that of the ReLU network. - **Approximating PolyReLU networks**: Provide the upper and lower bounds for approximating the PolyReLU network with the ReLU network, indicating that the PolyReLU network is more efficient in representational power. - **Optimal approximation rate in Sobolev space**: Prove that the PolyReLU network has the optimal approximation rate in Sobolev space, that is, the least number of required parameters under a given error tolerance. - **Experimental verification**: - **Dense model**: On the 1 - billion - parameter dense model, the models using PolyCom outperform other activation functions in terms of training loss, validation perplexity, and downstream task performance. - **Mixture - of - Experts (MoE)**