Unveiling and Controlling Anomalous Attention Distribution in Transformers

Ruiqing Yan,Xingbo Du,Haoyu Deng,Linghan Zheng,Qiuzhuang Sun,Jifang Hu,Yuhang Shao,Penghao Jiang,Jinrong Jiang,Lian Zhao
2024-07-04
Abstract:With the advent of large models based on the Transformer architecture, researchers have observed an anomalous phenomenon in the Attention mechanism--there is a very high attention on the first element, which is prevalent across Transformer-based models. It is crucial to understand it for the development of techniques focusing on attention distribution, such as Key-Value (KV) Cache compression and infinite extrapolation; however, the latent cause leaves to be unknown. In this paper, we analyze such a phenomenon from the perspective of waiver phenomenon, which involves reducing the internal values of certain elements in the sequence, allowing them to absorb excess attention without affecting their contribution to information. In specific models, due to differences in positional encoding and attention patterns, we have found that the selection of waiver elements by the model can be categorized into two methods: positional-encoding-based and feature-distribution-within-elements-based.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is an abnormal phenomenon that occurs in the Transformer model, namely, having a very high attention to the first element in the sequence in the attention mechanism. This phenomenon is widespread, but its underlying cause has not been identified yet. Understanding this phenomenon is crucial for the development of techniques that rely on the attention distribution, such as key - value cache compression and infinite extrapolation. Therefore, the paper aims to analyze this abnormality from the perspective of the "exemption" phenomenon, explore how the model selects certain elements as exemption options to absorb excessive attention without affecting their interaction with other elements, and proposes two methods for selecting exemption elements: a selection method based on position encoding and a selection method based on the feature distribution within elements. In addition, the paper also designs experiments to verify these hypotheses, can arbitrarily control whether an element becomes an exemption element or not, and has achieved remarkable results in the experiments.