Abstract:Attention is a fundamental component behind the remarkable achievements of large language models (LLMs). However, our current understanding of the attention mechanism, especially regarding how attention distributions are established, remains limited. Inspired by recent studies that explore the presence of attention sink in the initial token, which receives disproportionately large attention scores despite their lack of semantic importance, this work delves deeper into this phenomenon. We aim to provide a more profound understanding of the existence of attention sinks within LLMs and to uncover ways to enhance the achievable accuracy of LLMs by directly optimizing the attention distributions, without the need for weight finetuning. Specifically, this work begins with comprehensive visualizations of the attention distributions in LLMs during inference across various inputs and tasks. Based on these visualizations, to the best of our knowledge, we are the first to discover that (1) attention sinks occur not only at the start of sequences but also within later tokens of the input, and (2) not all attention sinks have a positive impact on the achievable accuracy of LLMs. Building upon our findings, we propose a training-free Attention Calibration Technique (ACT) that automatically optimizes the attention distributions on the fly during inference in an input-adaptive manner. Extensive experiments validate that ACT consistently enhances the accuracy of various LLMs across different applications. Specifically, ACT achieves an average improvement of up to 7.30% in accuracy across different datasets when applied to Llama-30B. Our code is available at <a class="link-external link-https" href="https://github.com/GATECH-EIC/ACT" rel="external noopener nofollow">this https URL</a>.

Exact Hard Monotonic Attention for Character-Level Transduction

On Biasing Transformer Attention Towards Monotonicity

Morphological Inflection Generation with Hard Monotonic Attention

Learning Monotonic Attention in Transducer for Streaming Generation

Better Simultaneous Translation with Monotonic Knowledge Distillation.

Infusing Future Information into Monotonic Attention Through Language Models

Learning Transductions and Alignments with RNN Seq2seq Models

Learning When to Concentrate or Divert Attention: Self-Adaptive Attention Temperature for Neural Machine Translation

Monotonic Location Attention for Length Generalization

Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration

Attention is All you Need

Latent Alignment and Variational Attention

Optimizing Attention for Sequence Modeling Via Reinforcement Learning.

Initial investigation of an encoder-decoder end-to-end TTS framework using marginalization of monotonic hard latent alignments

Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

Simulating Hard Attention Using Soft Attention

Enhancing Monotonicity for Robust Autoregressive Transformer TTS

Word Attention for Sequence to Sequence Text Understanding.

Efficient Monotonic Multihead Attention

Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention

Treeformer: Dense Gradient Trees for Efficient Attention Computation