Abstract:In modern large language models (LLMs), increasing the context length is crucial for improving comprehension and coherence in long-context, multi-modal, and retrieval-augmented language generation. While many recent transformer models attempt to extend their context length over a million tokens, they remain impractical due to the quadratic time and space complexities. Although recent works on linear and sparse attention mechanisms can achieve this goal, their real-world applicability is often limited by the need to re-train from scratch and significantly worse performance. In response, we propose a novel approach, Hierarchically Pruned Attention (HiP), which reduces the time complexity of the attention mechanism to $O(T \log T)$ and the space complexity to $O(T)$, where $T$ is the sequence length. We notice a pattern in the attention scores of pretrained LLMs where tokens close together tend to have similar scores, which we call ``attention locality''. Based on this observation, we utilize a novel tree-search-like algorithm that estimates the top-$k$ key tokens for a given query on the fly, which is mathematically guaranteed to have better performance than random attention pruning. In addition to improving the time complexity of the attention mechanism, we further optimize GPU memory usage by implementing KV cache offloading, which stores only $O(\log T)$ tokens on the GPU while maintaining similar decoding throughput. Experiments on benchmarks show that HiP, with its training-free nature, significantly reduces both prefill and decoding latencies, as well as memory usage, while maintaining high-quality generation with minimal degradation. HiP enables pretrained LLMs to scale up to millions of tokens on commodity GPUs, potentially unlocking long-context LLM applications previously deemed infeasible.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to reduce the time and space complexity of the attention mechanism while expanding the context length in large - language models (LLMs). Specifically, although many current Transformer - based models attempt to extend the context length to millions of tokens, due to the fact that the time and space complexity of the attention mechanism is quadratic, these models are difficult to implement in practical applications. Although some linear and sparse attention mechanisms can achieve this goal, they usually need to be retrained from scratch, and their performance drops significantly, limiting their application in the real world. In response to these problems, the paper proposes a new method - Hierarchically Pruned Attention (HiP), aiming to reduce the time complexity of the attention mechanism to $O(T \log T)$ and the space complexity to $O(T)$, where $T$ is the sequence length. HiP takes advantage of the locality feature of attention scores in pre - trained LLMs, that is, adjacent tokens tend to have similar attention scores. Based on this observation, HiP uses a tree - search - like algorithm to dynamically estimate the $k$ most important key tokens when given a query, thus achieving efficient attention calculation without sacrificing performance. Furthermore, in order to further optimize GPU memory usage, HiP also implements the KV - cache offloading technique, storing only $O(\log T)$ tokens on the GPU while maintaining a similar decoding throughput. Experimental results show that HiP can significantly reduce pre - filling and decoding delays while maintaining high - quality generation results, enabling pre - trained LLMs to handle long - context tasks with millions of tokens on ordinary GPUs, thus unlocking many long - context application scenarios that were previously considered infeasible.

A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention

A Fast Post-Training Pruning Framework for Transformers

Training-Free Exponential Context Extension via Cascading KV Cache

Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers

Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis

Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

Recycled Attention: Efficient inference for long-context language models

Accelerating Attention through Gradient-Based Learned Runtime Pruning

Evolving Masked Low-Rank Transformer for Long Text Understanding

Treeformer: Dense Gradient Trees for Efficient Attention Computation

A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

A Multi-Level Framework for Accelerating Training Transformer Models

Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs

Layer-wise Pruning of Transformer Attention Heads for Efficient Language Modeling

HSR-Enhanced Sparse Attention Acceleration

Fovea Transformer: Efficient Long-Context Modeling with Structured Fine-to-Coarse Attention

Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs

Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers