Abstract:Emerging Large Language Model (LLM) applications require long input prompts to perform complex downstream tasks like document analysis and code generation. For these long context length applications, the length of the input prompt poses a significant challenge in terms of inference efficiency since the inference costs increase linearly with sequence length. However, for many of these applications, much of the context in the prompt is fixed across different user inputs, thereby providing the opportunity to perform offline optimizations to process user inputs quickly, as they are received. In this work, we propose Squeezed Attention as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed. We first leverage K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value. During inference, we compare query tokens from the user input with the centroids to predict which of the keys from the fixed context are semantically relevant and need to be loaded during inference. We then compute exact attention using only these important keys from the fixed context, thereby reducing bandwidth and computational costs. We also extend our method to use a hierarchical centroid lookup to identify important keys, which can reduce the complexity of attention from linear to logarithmic with respect to the context length. We implement optimized Triton kernels for centroid comparison and sparse FlashAttention with important keys, achieving more than 4x speedups during both the prefill and generation phases for long-context inference. Furthermore, we have extensively evaluated our method on various long-context benchmarks including LongBench, where it achieves a 3x reduction in KV cache budget without accuracy loss and up to an 8x reduction with <0.5 point accuracy gap for various models.

SCA: Selective Compression Attention for Efficiently Extending the Context Window of Large Language Models

Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity

SCCA: Shifted Cross Chunk Attention for long contextual semantic expansion

Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression

Extending Context Window of Large Language Models via Semantic Compression

Recurrent Context Compression: Efficiently Expanding the Context Window of LLM

UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference

Beyond KV Caching: Shared Attention for Efficient LLMs

Context Compression for Auto-regressive Transformers with Sentinel Tokens

Adapting LLMs for Efficient Context Processing through Soft Prompt Compression

HSR-Enhanced Sparse Attention Acceleration

Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks

Context Compression and Extraction: Efficiency Inference of Large Language Models

Recycled Attention: Efficient inference for long-context language models

TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection

Squeezed Attention: Accelerating Long Context Length LLM Inference

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches

CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios

Training-Free Exponential Context Extension via Cascading KV Cache

In-context Autoencoder for Context Compression in a Large Language Model