Abstract:Serving numerous users and requests concurrently requires good fairness in Large Language Models (LLMs) serving system. This ensures that, at the same cost, the system can meet the Service Level Objectives (SLOs) of more users , such as time to first token (TTFT) and time between tokens (TBT), rather than allowing a few users to experience performance far exceeding the SLOs. To achieve better fairness, the preemption-based scheduling policy dynamically adjusts the priority of each request to maintain balance during runtime. However, existing systems tend to overly prioritize throughput, overlooking the overhead caused by preemption-induced context switching, which is crucial for maintaining fairness through priority adjustments. In this work, we identify three main challenges that result in this overhead. 1) Inadequate I/O utilization. 2) GPU idleness. 3) Unnecessary I/O transmission during multi-turn conversations. Our key insight is that the block-based KV cache memory policy in existing systems, while achieving near-zero memory waste, leads to discontinuity and insufficient granularity in the KV cache memory. To respond, we introduce FastSwitch, a fairness-aware serving system that not only aligns with existing KV cache memory allocation policy but also mitigates context switching overhead. Our evaluation shows that FastSwitch outperforms the state-of-the-art LLM serving system vLLM with speedups of 1.4-11.2x across different tail TTFT and TBT.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in large - language - model (LLMs) service systems, how to optimize the context - switching efficiency to improve fairness and service quality. Specifically, the paper focuses on reducing the context - switching overhead caused by pre - emptive scheduling while ensuring fair processing of multiple user requests, thereby enhancing the overall performance and service quality of the system. ### Specific Problems and Challenges 1. **Insufficient I/O Utilization**: - Although the KV cache management strategy in existing systems reduces memory fragmentation, it leads to non - continuous memory allocation, affecting the utilization efficiency of PCIe I/O bandwidth. - Small - grained KV cache swapping (e.g., 128KB) makes the scheduling overhead of each `cudaMemcpyAsync` call exceed the actual transmission time, resulting in I/O idleness. 2. **GPU Idle Problem**: - Pre - emptive scheduling can cause GPU idleness. Especially in the case of frequent priority updates, the operations of pre - fetching and swapping KV caches will make the GPU stall. - Hierarchical asynchronous swapping (such as AttentionStore) will interfere with CUDA graph execution and increase the inference time, especially when the swapping delay exceeds the inference time. 3. **Redundant I/O Transmissions in Multi - round Dialogues**: - In multi - round dialogues, repeated KV cache swapping operations will generate unnecessary I/O transmissions and waste bandwidth. - Due to the limited CPU memory, high - priority requests may invalidate the KV cache backups of low - priority requests, increasing unnecessary context removal. ### Solutions To solve the above problems, the paper proposes a new service system, FastSwitch, which improves the efficiency of pre - emptive context - switching through the following three key optimizations: 1. **Dynamic Block Group Manager**: - By managing larger block groups instead of individual blocks, it reduces the scheduling time and improves the I/O bandwidth utilization. - Dynamically adjusts the block group size to match the requirements of each request and optimizes the transmission efficiency. 2. **Multithreading Swap Manager**: - Introduces an adaptive swapping strategy, which dynamically selects synchronous or asynchronous swapping according to the system state, balancing the swapping overhead and token generation efficiency. - Uses C++ to implement API scheduling, avoiding Python GIL limitations and ensuring a conflict - free scheduling order for multi - stream CUDA runtime APIs. 3. **KV Cache Reuse Mechanism**: - Reuses previous KV caches in multi - round dialogues, reducing unnecessary I/O transmissions and improving resource utilization. ### Summary FastSwitch significantly improves the fairness and response speed of large - language - model service systems while maintaining efficient resource utilization by optimizing KV cache management and the context - switching process. Experimental results show that FastSwitch is 1.4 to 11.2 times faster than the existing state - of - the - art system vLLM on different tail TTFT and TBT, and the throughput is increased by up to 1.44 times.

FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving

Fairness in Serving Large Language Models

Llumnix: Dynamic Scheduling for Large Language Model Serving

Fast Distributed Inference Serving for Large Language Models

Efficient Memory Management for Large Language Model Serving with PagedAttention

LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism

Preble: Efficient Distributed Prompt Scheduling for LLM Serving

Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention

Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving

Fast State Restoration in LLM Serving with HCache

ALISE: Accelerating Large Language Model Serving with Speculative Scheduling

Efficient LLM Scheduling by Learning to Rank

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

EcoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving

LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management

Fast Context Switching Schedule Algorithm

PUZZLE: Efficiently Aligning Large Language Models Through Light-Weight Context Switch.

FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines