Abstract:Large language models (LLMs) have shown remarkable advances in supporting long-context comprehension and processing tasks. However, scaling the generation inference of LLMs to such long contexts incurs significant additional computation load, and demands a substantial GPU memory footprint to maintain the key-value (KV) cache of transformer-based LLMs. Existing KV cache compression methods, such as quantization, face memory bottlenecks as context length increases, while static-sized caches, such as eviction, suffer from inefficient policies. These limitations restrict deployment on consumer-grade devices like a single Nvidia 4090 GPU. To overcome this, we propose Locret, a framework for long-context LLM inference that introduces retaining heads to evaluate the causal importance of KV cache units, allowing for more accurate eviction within a fixed cache size. Locret is fine-tuned on top of the frozen backbone LLM using a minimal amount of data from standard long-context SFT datasets. During inference, we evict low-importance cache units along with a chunked prefill pattern, significantly reducing peak GPU memory usage. We conduct an extensive empirical study to evaluate Locret, where the experimental results show that Locret outperforms the recent competitive approaches, including InfLLM, Quantization, SirLLM, and MInference, in terms of memory efficiency and the quality of generated contents -- Locret achieves over a 20x and 8x KV cache compression ratio compared to the full KV cache for Phi-3-mini-128K and Llama-3.1-8B-instruct. Additionally, Locret can be combined with other methods, such as quantization and token merging. To our knowledge, Locret is the first framework capable of deploying Llama-3.1-8B or similar models on a single Nvidia 4090 GPU, enabling 128K long-context inference without compromising generation quality, and requiring little additional system optimizations.

RTiL: Real-Time Inference of Large Language Models on Memory-Constrained GPU Devices

High-throughput Generative Inference of Large Language Models with a Single GPU

Efficient LLM inference solution on Intel GPU

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

Inference Performance Optimization for Large Language Models on CPUs

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference

Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU

Distributed Inference Performance Optimization for LLMs on CPUs

Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation

Efficient and Economic Large Language Model Inference with Attention Offloading

InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference

Efficient Deployment of Large Language Model Across Cloud-Device Systems

Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models

Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching

Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads

Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores