Abstract:Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the usage of complementary efficiency optimizations such as prefix caching that skip redundant computations across requests. Most notably, their use of in-place state updates for recurrent layers precludes rolling back cache entries for partial sequence overlaps, and instead mandates only exact-match cache hits; the effect is a deluge of (large) cache entries per sequence, most of which yield minimal reuse opportunities. We present Marconi, the first system that supports efficient prefix caching with Hybrid LLMs. Key to Marconi are its novel admission and eviction policies that more judiciously assess potential cache entries based not only on recency, but also on (1) forecasts of their reuse likelihood across a taxonomy of different hit scenarios, and (2) the compute savings that hits deliver relative to memory footprints. Across diverse workloads and Hybrid models, Marconi achieves up to 34.4$\times$ higher token hit rates (71.1% or 617 ms lower TTFT) compared to state-of-the-art prefix caching systems.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address the challenges of prefix caching in Hybrid LLMs (Large Language Models). Specifically, hybrid models combine the advantages of Attention layers and recurrent layers (such as State - Space Models, SSMs) to support longer context windows. However, the unique properties of these models make it difficult to directly apply traditional prefix - caching techniques, resulting in low caching efficiency. #### Main problems: 1. **Irreversibility of SSM state updates**: - The state of the SSM layer is updated in - place, which means that once the sequence ends, its state cannot be rolled back to a state representing a partial prefix. Therefore, only exactly matching cache entries can be reused, resulting in a large number of cache entries being generated, but most entries have little chance of being reused. 2. **Inefficient use of cache entries**: - Fine - grained checkpoint creation (for example, creating a checkpoint every 256 tokens) leads to a large number of cache entries being generated, each of which is large (due to the size of the SSM state), but most entries are rarely reused, resulting in cache thrashing and an inefficient memory - computation savings trade - off. 3. **Deficiencies in existing cache systems**: - Existing prefix - cache systems are mainly designed for pure Transformer models and cannot effectively handle the SSM states in hybrid models, resulting in low cache hit rates and increased latency. #### Solutions: To solve these problems, the authors propose Marconi, a prefix - cache system specifically designed for hybrid LLMs. Marconi introduces new cache admission and eviction strategies to more intelligently evaluate potential cache entries and manage them according to their reuse potential and computational savings. - **Cache admission strategy**: Marconi evaluates the reuse possibility of SSM states by predicting future request prefix - reuse scenarios and only accepts SSM states with a high reuse probability. - **Cache eviction strategy**: Marconi introduces an eviction strategy based on FLOP (Floating - Point Operations per Second) efficiency, comprehensively considering the recent use of cache entries and potential computational savings to optimize cache utilization. Through these improvements, Marconi achieves significant performance improvements on various workloads and hybrid model architectures, achieving up to a 34.4 - fold increase in token hit rate and a 71.1% reduction in time - to - first - token (TTFT) compared to the state - of - the - art prefix - cache systems.

Marconi: Prefix Caching for the Era of Hybrid LLMs

MiniCache: KV Cache Compression in Depth Dimension for Large Language Models

Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching

On Optimal Caching and Model Multiplexing for Large Model Inference

Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks

Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention

EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models

Efficient LLM Inference with Kcache

Dual Cache for Long Document Neural Coreference Resolution

UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference

NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time

CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion

CORM: Cache Optimization with Recent Message for Large Language Model Inference

PQCache: Product Quantization-based KVCache for Long Context LLM Inference

XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference

LoCoCo: Dropping In Convolutions for Long Context Compression

PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation

Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference