Abstract:Transformers have found extensive applications across various domains due to the powerful fitting capabilities. This success can be partially attributed to their inherent nonlinearity. Thus, in addition to the ReLU function employed in the original transformer architecture, researchers have explored alternative modules such as GeLU and SwishGLU to enhance nonlinearity and thereby augment representational capacity. In this paper, we propose a novel category of polynomial composition activations (PolyCom), designed to optimize the dynamics of transformers. Theoretically, we provide a comprehensive mathematical analysis of PolyCom, highlighting its enhanced expressivity and efficacy relative to other activation functions. Notably, we demonstrate that networks incorporating PolyCom achieve the $\textbf{optimal approximation rate}$, indicating that PolyCom networks require minimal parameters to approximate general smooth functions in Sobolev spaces. We conduct empirical experiments on the pre-training configurations of large language models (LLMs), including both dense and sparse architectures. By substituting conventional activation functions with PolyCom, we enable LLMs to capture higher-order interactions within the data, thus improving performance metrics in terms of accuracy and convergence rates. Extensive experimental results demonstrate the effectiveness of our method, showing substantial improvements over other activation functions. Code is available at <a class="link-external link-https" href="https://github.com/BryceZhuo/PolyCom" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problem of activation function selection in large - language models (LLMs). Specifically, the paper proposes a new class of activation functions - Polynomial - Combined activation functions (PolyCom) to optimize the dynamic characteristics of the Transformer model. Traditional activation functions such as ReLU and its variants perform well in terms of computational efficiency and ease of implementation, but have limitations in modeling complex high - order relationships. These limitations are particularly evident in the Transformer architecture because the Transformer needs to capture subtle and complex dependencies in the data. ### Main contributions 1. **Propose a new activation function PolyCom**: - Introduce two specific forms of PolyCom: PolyReLU and PolyNorm, and explain in detail how they are integrated into the Transformer architecture. 2. **Theoretical analysis**: - Derive the bounds on the number of training parameters required for the PolyReLU network and its optimal approximation rate in Sobolev space. - Prove that the PolyReLU network can accurately represent the ReLU network without increasing the model size. - Provide the upper and lower bounds for approximating the PolyReLU network with the ReLU network, indicating that the PolyReLU network is more efficient in representational power. 3. **Experimental verification**: - Conduct extensive experiments on a 1 - billion - parameter dense model and a 1 - billion - active - parameter, 7 - billion - total - parameter Mixture - of - Experts (MoE) model. - The experimental results show that the models using PolyCom outperform other activation functions such as SwiGLU, GELU, and ReLU in terms of training loss, validation perplexity, and downstream task performance. ### Abstract The Transformer model is widely used in multiple fields due to its strong fitting ability, which is largely attributed to its inherent nonlinear characteristics. Besides the ReLU function used in the original Transformer architecture, researchers have also explored other modules such as GeLU and SwishGLU to enhance nonlinearity and improve representational ability. This paper proposes a new Polynomial - Combined activation function (PolyCom) aiming to optimize the dynamic characteristics of the Transformer. Theoretically, we conduct a comprehensive mathematical analysis of PolyCom, demonstrating its enhanced expressive power and effectiveness compared to other activation functions. Experimental results show that large - language models using PolyCom can capture higher - order data interactions, thus achieving significant improvements in accuracy and convergence speed. ### Key technical details - **Polynomial - Combined activation function (PolyCom)**: - **PolyReLU**: Defined as $ \text{PolyReLU}(x)=\sum_{i = 0}^{r}a_i\text{ReLU}^i(x) $, where $ \text{ReLU}^i(x)=\max\{x, 0\}^i $. - **PolyNorm**: Defined as $ \text{PolyNorm}(x)=\sum_{i = 0}^{r}a_i\frac{x^i}{\|x^i\|_2} $, where $ x^i $ represents the element - wise power operation of the input tensor $ x $, and $ \| \cdot \|_2 $ represents L2 normalization. - **Theoretical analysis**: - **Approximating ReLU networks**: Prove that the PolyReLU network can accurately represent the ReLU network, and the number of required parameters is comparable to that of the ReLU network. - **Approximating PolyReLU networks**: Provide the upper and lower bounds for approximating the PolyReLU network with the ReLU network, indicating that the PolyReLU network is more efficient in representational power. - **Optimal approximation rate in Sobolev space**: Prove that the PolyReLU network has the optimal approximation rate in Sobolev space, that is, the least number of required parameters under a given error tolerance. - **Experimental verification**: - **Dense model**: On the 1 - billion - parameter dense model, the models using PolyCom outperform other activation functions in terms of training loss, validation perplexity, and downstream task performance. - **Mixture - of - Experts (MoE)**

Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models

Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning

Interpreting token compositionality in LLMs: A robustness analysis

The Expressibility of Polynomial based Attention Scheme

From Attention to Activation: Unravelling the Enigmas of Large Language Models

Unified Normalization for Accelerating and Stabilizing Transformers

Improving Compositional Generalization Using Iterated Learning and Simplicial Embeddings

PolySketchFormer: Fast Transformers via Sketching Polynomial Kernels

Faith and Fate: Limits of Transformers on Compositionality

Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models

Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

PanGu-π: Enhancing Language Model Architectures via Nonlinearity Compensation

Do Large Language Models Have Compositional Ability? An Investigation into Limitations and Scalability

MoDeGPT: Modular Decomposition for Large Language Model Compression

The Impact of Depth on Compositional Generalization in Transformer Language Models

MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models

Activator: GLU Activation Function as the Core Component of a Vision Transformer

Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules

Out-of-distribution generalization via composition: a lens through induction heads in Transformers

Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models

Massive Activations in Large Language Models