Abstract:The growing complexity of LLM usage today, e.g., multi-round conversation and retrieval-augmented generation (RAG), makes contextual states (i.e., KV cache) reusable across user requests. Given the capacity constraints of GPU memory, only a limited number of contexts can be cached on GPU for reusing. Existing inference systems typically evict part of the KV cache and restore it by recomputing it from the original tokens or offloading it to host storage for later retrieval, both of which introduce substantial computational or I/O overheads. We propose HCache, a novel LLM state restoration method. Its key idea is to restore LLM states from intermediate activations and thus utilize computational and I/O resources with low overhead. We enhance HCache with two techniques, including i) a bubble-free restoration scheduler that integrates resource-complementary methods to optimize the balance between computation and IO tasks; and ii) a chunk-based storage manager to address the layout mismatch issue (i.e., layer-before-token saving versus token-before-layer restoration). Our evaluations, conducted using real-world tasks, show that HCache reduces the TTFT by up to 1.93X compared to KV offload while consuming 1.92-2.40X less storage space; compared to token recomputation, HCache achieves up to 5.73X reduction in TTFT.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in large - language - model (LLM) services, due to the limitation of GPU memory capacity, the historical context states (i.e., KV caches) cannot be all stored on the GPU. As a result, when processing new user requests, these states need to be restored from host storage, causing significant computational or I/O overheads. Specifically, existing inference systems usually restore states through two methods: one is recomputation, that is, regenerating KV caches from the original tokens; the other is KV offload, that is, saving KV caches to host storage and reloading them into GPU memory when needed. Both of these methods have obvious performance degradation problems. The former has high computational complexity, and the latter has a large I/O transfer volume. To overcome these problems, the paper proposes HCache, a new LLM state - restoration method. The core idea of HCache is to restore LLM states from intermediate activations (i.e., hidden states), thereby utilizing computational and I/O resources to achieve state restoration with lower overheads. The paper enhances HCache through the following two key techniques: 1. **Bubble - Free Restoration Scheduler**: This scheduler combines resource - complementary methods to optimize the balance between computational and I/O tasks, eliminate bubbles in the pipeline, and improve the restoration speed. 2. **Chunk - Based Storage Manager**: This manager solves the problem of storage - layout mismatch, that is, the inconsistency between saving layers before tokens and restoring tokens before layers, thereby optimizing the storage format and reducing state - restoration time. Through these techniques, HCache performs excellently on mainstream platforms, significantly reducing the time - to - first - token (TTFT) and storage - space consumption, and has a significant performance improvement compared with existing methods (such as KV offload and token recomputation).

Fast State Restoration in LLM Serving with HCache

XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference

CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion

Efficient LLM Inference with Kcache

In-context KV-Cache Eviction for LLMs via Attention-Gate

Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching

PQCache: Product Quantization-based KVCache for Long Context LLM Inference

MiniCache: KV Cache Compression in Depth Dimension for Large Language Models

Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation

Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference

MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache

Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks

Lossless KV Cache Compression to 2%

CORM: Cache Optimization with Recent Message for Large Language Model Inference

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression

LLM-dCache: Improving Tool-Augmented LLMs with GPT-Driven Localized Data Caching

LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy