Abstract:Language Models (LMs) assign significant attention to the first token, even if it is not semantically important, which is known as attention sink. This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others. Despite its widespread use, a deep understanding of attention sink in LMs is still lacking. In this work, we first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models. Furthermore, attention sink is observed to emerge during the LM pre-training, motivating us to investigate how optimization, data distribution, loss function, and model architecture in LM pre-training influence its emergence. We highlight that attention sink emerges after effective optimization on sufficient training data. The sink position is highly correlated with the loss function and data distribution. Most importantly, we find that attention sink acts more like key biases, storing extra attention scores, which could be non-informative and not contribute to the value computation. We also observe that this phenomenon (at least partially) stems from tokens' inner dependence on attention scores as a result of softmax normalization. After relaxing such dependence by replacing softmax attention with other attention operations, such as sigmoid attention without normalization, attention sinks do not emerge in LMs up to 1B parameters. The code is available at <a class="link-external link-https" href="https://github.com/sail-sg/Attention-Sink" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to understand the attention sink phenomenon in language models (LMs), especially in large - scale language models, where this phenomenon is manifested as the model's tendency to allocate significant attention to the first token, even if this token is not semantically important. Although this phenomenon has been widely adopted in a variety of application scenarios, such as streaming/long - context generation, KV cache optimization, inference acceleration, model quantization, etc., there is still a lack of in - depth understanding of the underlying mechanisms and influencing factors. Specifically, this paper aims to explore the following questions: 1. **Is the attention sink phenomenon普遍存在于 different input conditions in language models?** Researchers hope to verify that this phenomenon is not limited to specific datasets or model architectures, but is普遍存在的. 2. **When and how does the attention sink phenomenon occur during the pre - training process?** By analyzing the influence of optimization methods, data distribution, loss functions, and model architectures on attention sink, understand the specific mechanisms of its formation. 3. **The nature of the attention sink phenomenon and its impact on model performance**: Researchers have found that attention sink may only store additional attention scores and does not participate in value calculation, indicating that it may be due to the internal token - dependence caused by softmax normalization. In addition, when using other types of attention operations (such as sigmoid attention without normalization), the attention sink phenomenon does not occur in models with 1 billion parameters. In summary, the main objective of this paper is to conduct empirical research to gain an in - depth understanding of the causes of the attention sink phenomenon and its impact on the performance of language models, providing a theoretical basis for subsequent improvement of model design and training strategies.

When Attention Sink Emerges in Language Models: An Empirical View

Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration

Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs

Efficient Streaming Language Models with Attention Sinks

Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization

Spectral Filters, Dark Signals, and Attention Sinks

Attention Flows: Analyzing and Comparing Attention Mechanisms in Language Models

SinkLoRA: Enhanced Efficiency and Chat Capabilities for Long-Context Large Language Models

Self-attention Mechanism at the Token Level: Gradient Analysis and Algorithm Optimization.

Taking a Deep Breath: Enhancing Language Modeling of Large Language Models with Sentinel Tokens

Why Attentions May Not Be Interpretable?

HuaSLIM: Human Attention Motivated Shortcut Learning Identification and Mitigation for Large Language models

Seeing Clearly by Layer Two: Enhancing Attention Heads to Alleviate Hallucination in LVLMs

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

Attention Speaks Volumes: Localizing and Mitigating Bias in Language Models

Attention Tracker: Detecting Prompt Injection Attacks in LLMs

An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models

Pooling And Attention: What Are Effective Designs For LLM-Based Embedding Models?

Max-Margin Token Selection in Attention Mechanism

Anchor Attention, Small Cache: Code Generation with Large Language Models