Abstract:Transformer-based large language models (LLMs) excel in modeling complex language patterns but face significant computational costs during inference, especially with long inputs due to the attention mechanism's memory overhead. We observe that certain attention heads exhibit a distribution where the attention weights concentrate on tokens near the query token, termed as recency aware, which focuses on local and short-range dependencies. Leveraging this insight, we propose RecurFormer, a novel architecture that replaces these attention heads with linear recurrent neural networks (RNNs), specifically the Mamba architecture. This replacement reduces the cache size without evicting tokens, thus maintaining generation quality. RecurFormer retains the ability to model long-range dependencies through the remaining attention heads and allows for reusing pre-trained Transformer-based LLMs weights with continual training. Experiments demonstrate that RecurFormer matches the original model's performance while significantly enhancing inference efficiency. Our approach provides a practical solution to the computational challenges of Transformer-based LLMs inference, making it highly attractive for tasks involving long inputs.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper aims to solve the significant computational cost problem faced by Transformer - based large - scale language models (LLMs) during the inference process, especially when dealing with long inputs. Specifically, the memory overhead of the attention mechanism causes this problem. To address this issue, the authors observe that some attention heads exhibit a distribution pattern where the attention weights are concentrated on tokens close to the query token. This phenomenon is called "recency - aware", which mainly focuses on local and short - range dependencies. ### Solutions To solve the above problems, the authors propose a new architecture - RecurFormer. RecurFormer reduces the cache size by replacing those attention heads that exhibit "recency - aware" with linear recurrent neural networks (RNNs), especially the Mamba architecture, without evicting tokens, thus maintaining the generation quality. In addition, RecurFormer retains the ability to model long - range dependencies through the remaining attention heads and allows the reuse of pre - trained Transformer - based LLMs weights while performing continuous training to restore performance. ### Main contributions 1. **First discovery**: Inspired by the dependency length minimization (DLM) phenomenon in quantitative linguistics and the computational principle of the attention mechanism, the authors first observe that some attention heads in Transformer - based LLMs can be effectively replaced by linear RNN structures because they have the "recency - aware" property. 2. **Proposing a new architecture**: The authors propose RecurFormer, a new recursively optimized Transformer architecture, which replaces the attention heads affected by "recency - aware" with the Mamba architecture, thereby reducing the cache size while reusing the original model weights. 3. **Experimental verification**: Through the HashHop experiment, the authors show that RecurFormer can match the quality of the original model. Continuous training confirms the performance recovery, and the ablation study of the multi - query associated recall (MQAR) task shows that some attention mechanisms need to be retained. ### Conclusion RecurFormer provides a practical solution to the computational challenges of Transformer - based LLMs when processing long inputs during the inference process, making it more attractive in tasks involving long inputs while maintaining the generation quality comparable to the original model.

RecurFormer: Not All Transformer Heads Need Self-Attention

Recycled Attention: Efficient inference for long-context language models

Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

MemoryFormer: Minimize Transformer Computation by Removing Fully-Connected Layers

X-former Elucidator: Reviving Efficient Attention for Long Context Language Modeling

LMUFormer: Low Complexity Yet Powerful Spiking Model With Legendre Memory Units

Retentive Network: A Successor to Transformer for Large Language Models

Efficient and Economic Large Language Model Inference with Attention Offloading

Just read twice: closing the recall gap for recurrent language models

LoMA: Lossless Compressed Memory Attention

The Mamba in the Llama: Distilling and Accelerating Hybrid Models

Value Residual Learning For Alleviating Attention Concentration In Transformers

LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models

LongHeads: Multi-Head Attention is Secretly a Long Context Processor

TRAMS: Training-free Memory Selection for Long-range Language Modeling

Improving Transformers with Dynamically Composable Multi-Head Attention

RWKV: Reinventing RNNs for the Transformer Era

LazyFormer: Self Attention with Lazy Update

LaMemo: Language Modeling with Look-Ahead Memory

What Matters in Transformers? Not All Attention is Needed