Abstract:Augmented Large Language Models (LLMs) enhance the capabilities of standalone LLMs by integrating external data sources through API calls. In interactive LLM applications, efficient scheduling is crucial for maintaining low request completion times, directly impacting user engagement. However, these augmentations introduce scheduling challenges due to the need to manage limited memory for cached information (KV caches). As a result, traditional size-based scheduling algorithms, such as Shortest Job First (SJF), become less effective at minimizing completion times. Existing work focuses only on handling requests during API calls by preserving, discarding, or swapping memory without considering how to schedule requests with API calls. In this paper, we propose LAMPS, a novel LLM inference framework for augmented LLMs. LAMPS minimizes request completion time through a unified scheduling approach that considers the total length of requests and their handling strategies during API calls. Recognizing that LLM inference is memory-bound, our approach ranks requests based on their consumption of memory over time, which depends on both the output sizes and how a request is managed during its API calls. To implement our scheduling, LAMPS predicts the strategy that minimizes memory waste of a request during its API calls, aligning with but improving upon existing approaches. We also propose starvation prevention techniques and optimizations to mitigate the overhead of our scheduling. We implement LAMPS on top of vLLM and evaluate its performance against baseline LLM inference systems, demonstrating improvements in end-to-end latency by 27%-85% and reductions in TTFT by 4%-96% compared to the existing augmented-LLM system, with even greater gains over vLLM.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to minimize the request completion time by optimizing the scheduling strategy in Augmented Large Language Models (LLMs). Specifically, the paper focuses on how to effectively manage limited memory resources (especially KV caches) when handling requests with API calls to avoid high latency and low throughput. ### Problem Background Augmented LLMs expand their capabilities by integrating external data sources (such as API calls) so that they can handle more complex tasks. However, these enhancements bring new challenges, especially in terms of memory management and scheduling. Traditional size - based scheduling algorithms (such as Shortest Job First, SJF) become less effective in this case because they cannot consider the time and memory consumption of API calls. ### Main Challenges 1. **Memory Management**: Each request generates a key - value matrix (KV cache) during the decoding process, and these matrices will occupy a large amount of memory. During API calls, the system needs to decide how to handle these caches (retain, discard and recalculate, swap to CPU memory), which directly affects the memory use efficiency. 2. **Scheduling Strategy**: Existing scheduling strategies usually only consider the length of the request itself and ignore the time of API calls. This leads to the Head - of - Line Blocking problem, that is, long - running requests (including requests waiting for API responses) will prevent the efficient processing of shorter requests. ### Solutions To solve these problems, the paper proposes LAMPS (LLM API - and Memory - based Predictive Scheduling), a novel inference framework aimed at minimizing the request completion time through a unified scheduling method. The main features of LAMPS include: 1. **Predicting API Processing Strategies**: LAMPS predicts the duration and output length of API calls according to the input prompts of requests, and selects the optimal memory processing strategy (retain, discard and recalculate, swap) accordingly. 2. **Memory - Consumption - Based Scheduling**: LAMPS not only considers the total length of requests but also the memory processing strategy during API calls. It sorts requests by predicting the change of memory consumption over time, thereby optimizing scheduling. 3. **Preventing Starvation**: LAMPS introduces anti - starvation techniques and optimization measures to reduce scheduling overhead and ensure that all requests can be processed in a timely manner. ### Experimental Results The experimental results show that LAMPS improves the end - to - end latency by 27% - 85% and reduces the Time - to - First - Transmission (TTFT) by 4% - 96% compared with existing augmented LLM systems (such as INFERCEPT and vLLM). This indicates that LAMPS has significant advantages in handling API - augmented requests. In conclusion, by proposing the LAMPS framework, this paper effectively solves the memory management and scheduling challenges faced by augmented LLMs when handling requests with API calls, and significantly improves the performance and efficiency of the system.

Fast Inference for Augmented Large Language Models

Efficient LLM Scheduling by Learning to Rank

Fast Distributed Inference Serving for Large Language Models

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

ALISE: Accelerating Large Language Model Serving with Speculative Scheduling

Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

Efficient Memory Management for Large Language Model Serving with PagedAttention

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

Don't Stop Me Now: Embedding Based Scheduling for LLMs

Llumnix: Dynamic Scheduling for Large Language Model Serving

Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

InferCept: Efficient Intercept Support for Augmented Large Language Model Inference

Fairness in Serving Large Language Models

SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM

LiveMind: Low-latency Large Language Models with Simultaneous Inference

Open-AI model Efficient Memory Reduce Management for the Large Language Models (LLMs) Serving with Paged Attention of sharing the KV Cashes

UELLM: A Unified and Efficient Approach for LLM Inference Serving