When Attention Sink Emerges in Language Models: An Empirical View

Xiangming Gu,Tianyu Pang,Chao Du,Qian Liu,Fengzhuo Zhang,Cunxiao Du,Ye Wang,Min Lin
2024-10-15
Abstract:Language Models (LMs) assign significant attention to the first token, even if it is not semantically important, which is known as attention sink. This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others. Despite its widespread use, a deep understanding of attention sink in LMs is still lacking. In this work, we first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models. Furthermore, attention sink is observed to emerge during the LM pre-training, motivating us to investigate how optimization, data distribution, loss function, and model architecture in LM pre-training influence its emergence. We highlight that attention sink emerges after effective optimization on sufficient training data. The sink position is highly correlated with the loss function and data distribution. Most importantly, we find that attention sink acts more like key biases, storing extra attention scores, which could be non-informative and not contribute to the value computation. We also observe that this phenomenon (at least partially) stems from tokens' inner dependence on attention scores as a result of softmax normalization. After relaxing such dependence by replacing softmax attention with other attention operations, such as sigmoid attention without normalization, attention sinks do not emerge in LMs up to 1B parameters. The code is available at <a class="link-external link-https" href="https://github.com/sail-sg/Attention-Sink" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to understand the attention sink phenomenon in language models (LMs), especially in large - scale language models, where this phenomenon is manifested as the model's tendency to allocate significant attention to the first token, even if this token is not semantically important. Although this phenomenon has been widely adopted in a variety of application scenarios, such as streaming/long - context generation, KV cache optimization, inference acceleration, model quantization, etc., there is still a lack of in - depth understanding of the underlying mechanisms and influencing factors. Specifically, this paper aims to explore the following questions: 1. **Is the attention sink phenomenon普遍存在于 different input conditions in language models?** Researchers hope to verify that this phenomenon is not limited to specific datasets or model architectures, but is普遍存在的. 2. **When and how does the attention sink phenomenon occur during the pre - training process?** By analyzing the influence of optimization methods, data distribution, loss functions, and model architectures on attention sink, understand the specific mechanisms of its formation. 3. **The nature of the attention sink phenomenon and its impact on model performance**: Researchers have found that attention sink may only store additional attention scores and does not participate in value calculation, indicating that it may be due to the internal token - dependence caused by softmax normalization. In addition, when using other types of attention operations (such as sigmoid attention without normalization), the attention sink phenomenon does not occur in models with 1 billion parameters. In summary, the main objective of this paper is to conduct empirical research to gain an in - depth understanding of the causes of the attention sink phenomenon and its impact on the performance of language models, providing a theoretical basis for subsequent improvement of model design and training strategies.