Enabling Efficient Serverless Inference Serving for LLM (Large Language Model) in the Cloud

Himel Ghosh
2024-11-24
Abstract:This review report discusses the cold start latency in serverless inference and existing solutions. It particularly reviews the ServerlessLLM method, a system designed to address the cold start problem in serverless inference for large language models. Traditional serverless approaches struggle with high latency due to the size of LLM checkpoints and the overhead of initializing GPU resources. ServerlessLLM introduces a multitier checkpoint loading system, leveraging underutilized GPU memory and storage to reduce startup times by 6--8x compared to existing methods. It also proposes live inference migration and a startup-time-optimized model scheduler, ensuring efficient resource allocation and minimizing delays. This system significantly improves performance and scalability in serverless environments for LLM workloads. Besides ServerlessLLM, several other methods from recent research literature, including Rainbowcake, are reviewed in this paper. Further discussions explore how FaaS providers tackle cold starts and the possible future scopes.
Distributed, Parallel, and Cluster Computing,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the cold - start latency problem encountered when deploying large - language models (LLMs) in a serverless environment. Specifically, the paper discusses the following key challenges: 1. **Cold - start latency**: - When a serverless function starts from an idle state, significant latency occurs due to the need to load the checkpoints of large - language models and initialize GPU resources. - These latencies are particularly detrimental to real - time applications (such as chatbots, translation tools, etc.), as they rely on fast response times. 2. **Insufficiencies of existing solutions**: - Traditional serverless methods face high - latency problems when dealing with large - language models, mainly due to the size of the model checkpoints (usually reaching hundreds of gigabytes) and the overhead of initializing GPU resources. - Existing mitigation measures (such as over - subscribing GPUs, caching model checkpoints, deploying additional storage servers, etc.) have certain effects, but still have limitations when dealing with LLMs, such as increased costs and insufficient memory resources. 3. **Optimizing resource utilization**: - The paper proposes a new system - ServerlessLLM, which aims to reduce cold - start latency through a multi - layer checkpoint loading mechanism, real - time inference migration, and a model scheduler with optimized startup time. - ServerlessLLM makes full use of under - utilized GPU memory and storage resources, thereby significantly improving performance and scalability. ### Main solutions - **Multi - layer checkpoint loading system**: Accelerate the loading of large models through a hierarchical storage structure (GPU memory, DRAM, SSD), reduce loading bottlenecks, and improve initialization speed. - **Real - time inference migration**: Allow inference tasks to migrate seamlessly between different servers, avoiding latency caused by uneven resource allocation. - **Model scheduler with optimized startup time**: Perform intelligent scheduling according to model loading time and migration time to ensure the shortest startup time. Through these innovations, ServerlessLLM significantly reduces cold - start latency and improves the LLMs' inference performance and resource utilization in a serverless environment.