Abstract:Transformer-based Large Language Models (LLMs) have exhibited remarkable success in various natural language processing tasks primarily attributed to self-attention mechanism, which requires a token to consider all preceding tokens as its context to compute the attention score. However, when the context length L becomes very large (e.g., 32K), more redundant context information will be included w.r.t. any tokens, making the self-attention suffer from two main limitations: 1) The computational and memory complexity scales quadratically w.r.t. L; 2) The presence of redundant context information may hamper the model to capture dependencies among crucial tokens, which may degrade the representation performance. In this paper, we propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-range context modeling, which consists of two components: 1) Globality-pooling attention that divides input tokens into groups and then dynamically merges tokens within each group into one core token based on their significance; 2) Locality-preserved attention that incorporates neighboring tokens into the attention calculation. The two complementary attentions will then be fused to the final attention, maintaining comprehensive modeling ability as the full self-attention. In this way, the core context information w.r.t. a given token will be automatically focused and strengthened, while the context information in redundant groups will be diminished during the learning process. As a result, the computational and memory complexity will be significantly reduced. More importantly, the CCA-Attention can improve the long-context modeling ability by diminishing the redundant context information. Extensive experimental results demonstrate that our CCA-Attention significantly outperforms state-of-the-art models in terms of computational efficiency and long-context modeling ability.

Fovea Transformer: Efficient Long-Context Modeling with Structured Fine-to-Coarse Attention

Focused Transformer: Contrastive Training for Context Scaling

Adaptive Multi-Resolution Attention with Linear Complexity

Factorization Vision Transformer: Modeling Long Range Dependency with Local Window Cost

Hybrid Focal and Full-Range Attention Based Graph Transformers

FLatten Transformer: Vision Transformer using Focused Linear Attention

Lightweight Vision Transformer with Bidirectional Interaction

Improving Transformers with Dynamically Composable Multi-Head Attention

CoCA: Fusing Position Embedding with Collinear Constrained Attention in Transformers for Long Context Window Extending

Fastformer: Additive Attention Can Be All You Need

Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

Evolving Masked Low-Rank Transformer for Long Text Understanding

Modeling Graph Structure in Transformer for Better AMR-to-Text Generation.

Core Context Aware Attention for Long Context Language Modeling

AttentionViz: A Global View of Transformer Attention

TransformerFAM: Feedback attention is working memory

FAM: Improving columnar vision transformer with feature attention mechanism

Long-range Sequence Modeling with Predictable Sparse Attention.

Fast-StrucTexT: An Efficient Hourglass Transformer with Modality-guided Dynamic Token Merge for Document Understanding

AxWin Transformer: A Context-Aware Vision Transformer Backbone with Axial Windows

Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers