Abstract:Emerging Large Language Model (LLM) applications require long input prompts to perform complex downstream tasks like document analysis and code generation. For these long context length applications, the length of the input prompt poses a significant challenge in terms of inference efficiency since the inference costs increase linearly with sequence length. However, for many of these applications, much of the context in the prompt is fixed across different user inputs, thereby providing the opportunity to perform offline optimizations to process user inputs quickly, as they are received. In this work, we propose Squeezed Attention as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed. We first leverage K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value. During inference, we compare query tokens from the user input with the centroids to predict which of the keys from the fixed context are semantically relevant and need to be loaded during inference. We then compute exact attention using only these important keys from the fixed context, thereby reducing bandwidth and computational costs. We also extend our method to use a hierarchical centroid lookup to identify important keys, which can reduce the complexity of attention from linear to logarithmic with respect to the context length. We implement optimized Triton kernels for centroid comparison and sparse FlashAttention with important keys, achieving more than 4x speedups during both the prefill and generation phases for long-context inference. Furthermore, we have extensively evaluated our method on various long-context benchmarks including LongBench, where it achieves a 3x reduction in KV cache budget without accuracy loss and up to an 8x reduction with <0.5 point accuracy gap for various models.

FocusLLM: Precise Understanding of Long Context by Dynamic Condensing

Reducing Distraction in Long-Context Language Models by Focused Learning

InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory

Focused Transformer: Contrastive Training for Context Scaling

Extending Context Window of Large Language Models via Semantic Compression

Adapting LLMs for Efficient Context Processing through Soft Prompt Compression

LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning

Ltri-LLM: Streaming Long Context Inference for LLMs with Training-Free Dynamic Triangular Attention Pattern

Squeezed Attention: Accelerating Long Context Length LLM Inference

E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning

LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression

Empower Your Model with Longer and Better Context Comprehension

In-Context Former: Lightning-fast Compressing Context for Large Language Model

Focused Large Language Models are Stable Many-Shot Learners

A Controlled Study on Long Context Extension and Generalization in LLMs

LongHeads: Multi-Head Attention is Secretly a Long Context Processor

LLM×MapReduce: Simplified Long-Sequence Processing Using Large Language Models

LLoCO: Learning Long Contexts Offline

Make Your LLM Fully Utilize the Context

Training-Free Long-Context Scaling of Large Language Models