Massive Activations in Large Language Models

Mingjie Sun,Xinlei Chen,J. Zico Kolter,Zhuang Liu

2024-08-15

Abstract:We observe an empirical phenomenon in Large Language Models (LLMs) -- very few activations exhibit significantly larger values than others (e.g., 100,000 times larger). We call them massive activations. First, we demonstrate the widespread existence of massive activations across various LLMs and characterize their locations. Second, we find their values largely stay constant regardless of the input, and they function as indispensable bias terms in LLMs. Third, these massive activations lead to the concentration of attention probabilities to their corresponding tokens, and further, implicit bias terms in the self-attention output. Last, we also study massive activations in Vision Transformers. Code is available at <a class="link-external link-https" href="https://github.com/locuslab/massive-activations" rel="external noopener nofollow">this https URL</a>.

Computation and Language,Machine Learning

What problem does this paper attempt to address?

The paper attempts to address a phenomenon observed in the internal representations of large language models (LLMs), where certain activation values are significantly larger than others (e.g., up to 100,000 times larger), and these activation values remain relatively constant despite changes in input, serving as indispensable bias terms. Additionally, these large activation values cause attention probabilities to concentrate on specific tokens and introduce implicit bias terms in the self-attention output. Specifically, the paper investigates LLMs of different scales and architectures and discovers the following points: 1. **Existence and Location**: These large activations are present in the middle layers of the model and appear in fixed feature dimensions, usually associated with the start token or separator token. 2. **Functional Role**: These large activations play a fixed but important role as bias terms in the internal computations of the model. 3. **Attention Concentration**: These large activations lead to attention being concentrated on specific tokens and introduce implicit bias terms in the self-attention mechanism. 4. **Elimination Method**: Adding explicit bias terms in the self-attention mechanism can eliminate these large activations. The paper also extends this finding to Vision Transformers (ViTs), discovering that while large activations are less common in ViTs, they still exist and exhibit similar functional characteristics.

Massive Activations in Large Language Models

Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs

Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

Unraveling Babel: Exploring Multilingual Activation Patterns of LLMs and Their Applications

Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models

From Attention to Activation: Unravelling the Enigmas of Large Language Models

How do Large Language Models Handle Multilingualism?

Converging to a Lingua Franca: Evolution of Linguistic Regions and Semantics Alignment in Multilingual Large Language Models

Achieving Sparse Activation in Small Language Models

Why Larger Language Models Do In-context Learning Differently?

House of Cards: Massive Weights in LLMs

A Law of Next-Token Prediction in Large Language Models

Unlocking the potential: A comprehensive exploration of large language models in natural language processing

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

Exploring Activation Patterns of Parameters in Language Models

Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

Large Language Models: A Survey

A Survey of Large Language Models

Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges

The Information of Large Language Model Geometry