Abstract:Practitioners have consistently observed three puzzling phenomena in transformer-based large language models (LLMs): attention sinks, value-state drains, and residual-state peaks, collectively referred to as extreme-token phenomena. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights, exhibiting significantly smaller value states, and having much larger residual-state norms than those of other tokens. These extreme tokens give rise to various challenges in LLM inference, quantization, and interpretability. We elucidate the mechanisms behind extreme-token phenomena. First, we show that these phenomena arise in very simple architectures -- transformers with one to three layers -- trained on a toy model, the Bigram-Backcopy (BB) task. In this setting, we identify an active-dormant mechanism, where attention heads become sinks for specific input domains while remaining non-sinks for others. Our theoretical analysis of the training dynamics reveals that these phenomena are driven by a mutual reinforcement mechanism. Building on these insights, we propose strategies to mitigate extreme-token phenomena during pretraining, including replacing softmax with ReLU and Adam with SGD. Next, we extend our analysis to pretrained LLMs, including Llama and OLMo, showing that many attention heads exhibit a similar active-dormant mechanism as in the BB task, and that the mutual reinforcement mechanism also governs the emergence of extreme-token phenomena during LLM pretraining. Our results reveal that many of the static and dynamic properties of extreme-token phenomena predicted by the BB task align with observations in pretrained LLMs.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are to explain and understand three puzzling phenomena observed in Transformer - based large - scale language models (LLMs): attention sinks, value - state drains, and residual - state peaks, which are collectively referred to as extreme - token phenomena. These phenomena lead to a series of challenges in LLM inference, quantization, and interpretability. Specifically, the paper focuses on the following issues: 1. **Attention Sinks**: - In many attention heads, the initial tokens (such as the start token `<s>` or the separator token) attract disproportionately high attention weights. These tokens are called "sink tokens". 2. **Value - State Drains**: - For the attention heads showing attention sinks, the value - state of the sink tokens is usually much smaller than that of other tokens. 3. **Residual - State Peaks**: - The norm of the residual - state of the sink tokens is significantly greater than that of other tokens, especially in non - first - and - last layers. These problems occur together and are consistently present in various pre - trained LLMs. These extreme - token phenomena bring multiple challenges to downstream tasks, such as long - context inference, model quantization, and reduced interpretability of attention maps. To address these challenges, the goal of the paper is to reveal the mechanisms behind these phenomena and propose mitigation strategies. By studying simplified Transformer architectures (such as single - layer to three - layer Transformers) and a simple task (Bigram - Backcopy task), the authors discovered an "active - dormant mechanism" and a "mutual reinforcement mechanism", and proposed several improvement schemes to reduce the impact of these phenomena. ### Main Contributions 1. **Revealing Mechanisms**: - By simplifying the model and task, it shows how extreme - token phenomena arise from the active - dormant mechanism and the mutual reinforcement mechanism. 2. **Theoretical and Experimental Evidence**: - It provides theoretical analysis and experimental proof, indicating that these mechanisms exist not only in simplified models but also in pre - trained LLMs. 3. **Proposing Mitigation Strategies**: - It is recommended to replace the SoftMax activation function with ReLU in the attention heads and replace the Adam optimizer with SGD to mitigate extreme - token phenomena. In summary, this paper aims to provide theoretical basis and practical suggestions for improving the design and performance of large - scale language models by deeply understanding the mechanisms of extreme - token phenomena.

Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs

When Attention Sink Emerges in Language Models: An Empirical View

From Attention to Activation: Unravelling the Enigmas of Large Language Models

Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration

Extending Token Computation for LLM Reasoning

Massive Activations in Large Language Models

TLM: Token-Level Masking for Transformers

Toward a Theory of Tokenization in LLMs

TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning

Attend First, Consolidate Later: On the Importance of Attention in Different LLM Layers

Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization

An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models

Unraveling Babel: Exploring Multilingual Activation Patterns of LLMs and Their Applications

Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization

Unlocking Tokens as Data Points for Generalization Bounds on Larger Language Models

Opening the Black Box: Analyzing Attention Weights and Hidden States in Pre-trained Language Models for Non-language Tasks

Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

Look Within, Why LLMs Hallucinate: A Causal Perspective

Unveiling and Controlling Anomalous Attention Distribution in Transformers