Massive Activations in Large Language Models

Mingjie Sun,Xinlei Chen,J. Zico Kolter,Zhuang Liu
2024-08-15
Abstract:We observe an empirical phenomenon in Large Language Models (LLMs) -- very few activations exhibit significantly larger values than others (e.g., 100,000 times larger). We call them massive activations. First, we demonstrate the widespread existence of massive activations across various LLMs and characterize their locations. Second, we find their values largely stay constant regardless of the input, and they function as indispensable bias terms in LLMs. Third, these massive activations lead to the concentration of attention probabilities to their corresponding tokens, and further, implicit bias terms in the self-attention output. Last, we also study massive activations in Vision Transformers. Code is available at <a class="link-external link-https" href="https://github.com/locuslab/massive-activations" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address a phenomenon observed in the internal representations of large language models (LLMs), where certain activation values are significantly larger than others (e.g., up to 100,000 times larger), and these activation values remain relatively constant despite changes in input, serving as indispensable bias terms. Additionally, these large activation values cause attention probabilities to concentrate on specific tokens and introduce implicit bias terms in the self-attention output. Specifically, the paper investigates LLMs of different scales and architectures and discovers the following points: 1. **Existence and Location**: These large activations are present in the middle layers of the model and appear in fixed feature dimensions, usually associated with the start token or separator token. 2. **Functional Role**: These large activations play a fixed but important role as bias terms in the internal computations of the model. 3. **Attention Concentration**: These large activations lead to attention being concentrated on specific tokens and introduce implicit bias terms in the self-attention mechanism. 4. **Elimination Method**: Adding explicit bias terms in the self-attention mechanism can eliminate these large activations. The paper also extends this finding to Vision Transformers (ViTs), discovering that while large activations are less common in ViTs, they still exist and exhibit similar functional characteristics.