RecurFormer: Not All Transformer Heads Need Self-Attention

Ruiqing Yan,Linghan Zheng,Xingbo Du,Han Zou,Yufeng Guo,Jianfei Yang
2024-10-10
Abstract:Transformer-based large language models (LLMs) excel in modeling complex language patterns but face significant computational costs during inference, especially with long inputs due to the attention mechanism's memory overhead. We observe that certain attention heads exhibit a distribution where the attention weights concentrate on tokens near the query token, termed as recency aware, which focuses on local and short-range dependencies. Leveraging this insight, we propose RecurFormer, a novel architecture that replaces these attention heads with linear recurrent neural networks (RNNs), specifically the Mamba architecture. This replacement reduces the cache size without evicting tokens, thus maintaining generation quality. RecurFormer retains the ability to model long-range dependencies through the remaining attention heads and allows for reusing pre-trained Transformer-based LLMs weights with continual training. Experiments demonstrate that RecurFormer matches the original model's performance while significantly enhancing inference efficiency. Our approach provides a practical solution to the computational challenges of Transformer-based LLMs inference, making it highly attractive for tasks involving long inputs.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper aims to solve the significant computational cost problem faced by Transformer - based large - scale language models (LLMs) during the inference process, especially when dealing with long inputs. Specifically, the memory overhead of the attention mechanism causes this problem. To address this issue, the authors observe that some attention heads exhibit a distribution pattern where the attention weights are concentrated on tokens close to the query token. This phenomenon is called "recency - aware", which mainly focuses on local and short - range dependencies. ### Solutions To solve the above problems, the authors propose a new architecture - RecurFormer. RecurFormer reduces the cache size by replacing those attention heads that exhibit "recency - aware" with linear recurrent neural networks (RNNs), especially the Mamba architecture, without evicting tokens, thus maintaining the generation quality. In addition, RecurFormer retains the ability to model long - range dependencies through the remaining attention heads and allows the reuse of pre - trained Transformer - based LLMs weights while performing continuous training to restore performance. ### Main contributions 1. **First discovery**: Inspired by the dependency length minimization (DLM) phenomenon in quantitative linguistics and the computational principle of the attention mechanism, the authors first observe that some attention heads in Transformer - based LLMs can be effectively replaced by linear RNN structures because they have the "recency - aware" property. 2. **Proposing a new architecture**: The authors propose RecurFormer, a new recursively optimized Transformer architecture, which replaces the attention heads affected by "recency - aware" with the Mamba architecture, thereby reducing the cache size while reusing the original model weights. 3. **Experimental verification**: Through the HashHop experiment, the authors show that RecurFormer can match the quality of the original model. Continuous training confirms the performance recovery, and the ablation study of the multi - query associated recall (MQAR) task shows that some attention mechanisms need to be retained. ### Conclusion RecurFormer provides a practical solution to the computational challenges of Transformer - based LLMs when processing long inputs during the inference process, making it more attractive in tasks involving long inputs while maintaining the generation quality comparable to the original model.