Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs

Tianyu Guo,Druv Pai,Yu Bai,Jiantao Jiao,Michael I. Jordan,Song Mei
2024-10-18
Abstract:Practitioners have consistently observed three puzzling phenomena in transformer-based large language models (LLMs): attention sinks, value-state drains, and residual-state peaks, collectively referred to as extreme-token phenomena. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights, exhibiting significantly smaller value states, and having much larger residual-state norms than those of other tokens. These extreme tokens give rise to various challenges in LLM inference, quantization, and interpretability. We elucidate the mechanisms behind extreme-token phenomena. First, we show that these phenomena arise in very simple architectures -- transformers with one to three layers -- trained on a toy model, the Bigram-Backcopy (BB) task. In this setting, we identify an active-dormant mechanism, where attention heads become sinks for specific input domains while remaining non-sinks for others. Our theoretical analysis of the training dynamics reveals that these phenomena are driven by a mutual reinforcement mechanism. Building on these insights, we propose strategies to mitigate extreme-token phenomena during pretraining, including replacing softmax with ReLU and Adam with SGD. Next, we extend our analysis to pretrained LLMs, including Llama and OLMo, showing that many attention heads exhibit a similar active-dormant mechanism as in the BB task, and that the mutual reinforcement mechanism also governs the emergence of extreme-token phenomena during LLM pretraining. Our results reveal that many of the static and dynamic properties of extreme-token phenomena predicted by the BB task align with observations in pretrained LLMs.
Machine Learning
What problem does this paper attempt to address?
The problems that this paper attempts to solve are to explain and understand three puzzling phenomena observed in Transformer - based large - scale language models (LLMs): attention sinks, value - state drains, and residual - state peaks, which are collectively referred to as extreme - token phenomena. These phenomena lead to a series of challenges in LLM inference, quantization, and interpretability. Specifically, the paper focuses on the following issues: 1. **Attention Sinks**: - In many attention heads, the initial tokens (such as the start token `<s>` or the separator token) attract disproportionately high attention weights. These tokens are called "sink tokens". 2. **Value - State Drains**: - For the attention heads showing attention sinks, the value - state of the sink tokens is usually much smaller than that of other tokens. 3. **Residual - State Peaks**: - The norm of the residual - state of the sink tokens is significantly greater than that of other tokens, especially in non - first - and - last layers. These problems occur together and are consistently present in various pre - trained LLMs. These extreme - token phenomena bring multiple challenges to downstream tasks, such as long - context inference, model quantization, and reduced interpretability of attention maps. To address these challenges, the goal of the paper is to reveal the mechanisms behind these phenomena and propose mitigation strategies. By studying simplified Transformer architectures (such as single - layer to three - layer Transformers) and a simple task (Bigram - Backcopy task), the authors discovered an "active - dormant mechanism" and a "mutual reinforcement mechanism", and proposed several improvement schemes to reduce the impact of these phenomena. ### Main Contributions 1. **Revealing Mechanisms**: - By simplifying the model and task, it shows how extreme - token phenomena arise from the active - dormant mechanism and the mutual reinforcement mechanism. 2. **Theoretical and Experimental Evidence**: - It provides theoretical analysis and experimental proof, indicating that these mechanisms exist not only in simplified models but also in pre - trained LLMs. 3. **Proposing Mitigation Strategies**: - It is recommended to replace the SoftMax activation function with ReLU in the attention heads and replace the Adam optimizer with SGD to mitigate extreme - token phenomena. In summary, this paper aims to provide theoretical basis and practical suggestions for improving the design and performance of large - scale language models by deeply understanding the mechanisms of extreme - token phenomena.