Abstract:This review report discusses the cold start latency in serverless inference and existing solutions. It particularly reviews the ServerlessLLM method, a system designed to address the cold start problem in serverless inference for large language models. Traditional serverless approaches struggle with high latency due to the size of LLM checkpoints and the overhead of initializing GPU resources. ServerlessLLM introduces a multitier checkpoint loading system, leveraging underutilized GPU memory and storage to reduce startup times by 6--8x compared to existing methods. It also proposes live inference migration and a startup-time-optimized model scheduler, ensuring efficient resource allocation and minimizing delays. This system significantly improves performance and scalability in serverless environments for LLM workloads. Besides ServerlessLLM, several other methods from recent research literature, including Rainbowcake, are reviewed in this paper. Further discussions explore how FaaS providers tackle cold starts and the possible future scopes.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the cold - start latency problem encountered when deploying large - language models (LLMs) in a serverless environment. Specifically, the paper discusses the following key challenges: 1. **Cold - start latency**: - When a serverless function starts from an idle state, significant latency occurs due to the need to load the checkpoints of large - language models and initialize GPU resources. - These latencies are particularly detrimental to real - time applications (such as chatbots, translation tools, etc.), as they rely on fast response times. 2. **Insufficiencies of existing solutions**: - Traditional serverless methods face high - latency problems when dealing with large - language models, mainly due to the size of the model checkpoints (usually reaching hundreds of gigabytes) and the overhead of initializing GPU resources. - Existing mitigation measures (such as over - subscribing GPUs, caching model checkpoints, deploying additional storage servers, etc.) have certain effects, but still have limitations when dealing with LLMs, such as increased costs and insufficient memory resources. 3. **Optimizing resource utilization**: - The paper proposes a new system - ServerlessLLM, which aims to reduce cold - start latency through a multi - layer checkpoint loading mechanism, real - time inference migration, and a model scheduler with optimized startup time. - ServerlessLLM makes full use of under - utilized GPU memory and storage resources, thereby significantly improving performance and scalability. ### Main solutions - **Multi - layer checkpoint loading system**: Accelerate the loading of large models through a hierarchical storage structure (GPU memory, DRAM, SSD), reduce loading bottlenecks, and improve initialization speed. - **Real - time inference migration**: Allow inference tasks to migrate seamlessly between different servers, avoiding latency caused by uneven resource allocation. - **Model scheduler with optimized startup time**: Perform intelligent scheduling according to model loading time and migration time to ensure the shortest startup time. Through these innovations, ServerlessLLM significantly reduces cold - start latency and improves the LLMs' inference performance and resource utilization in a serverless environment.

Enabling Efficient Serverless Inference Serving for LLM (Large Language Model) in the Cloud

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models

Design and implementation of efficient distributed deep learning model inference architecture on serverless computation

A Review: Cold Start Latency in Serverless Computing

Cold Start Latency in Serverless Computing: A Systematic Review, Taxonomy, and Future Directions

LLMaaS: Serving Large Language Models on Trusted Serverless Computing Platforms

Mitigating Cold Starts in Serverless Platforms: A Pool-Based Approach

On-demand Cold Start Frequency Reduction with Off-Policy Reinforcement Learning in Serverless Computing

FaaSLight : General Application-Level Cold-Start Latency Optimization for Function-as-a-Service in Serverless Computing

MLLess: Achieving Cost Efficiency in Serverless Machine Learning Training

FaaSLight: General Application-Level Cold-Start Latency Optimization for Function-as-a-Service in Serverless Computing

Making Serverless Not So Cold in Edge Clouds: A Cost-Effective Online Approach

FSD-Inference: Fully Serverless Distributed Inference with Scalable Cloud Communication

Serverless Cold Start Performance Optimization Based on Multi-Request Processing and Adaptive Hierarchical Scaling

A Survey of Serverless Machine Learning Model Inference

LaSS: Running Latency Sensitive Serverless Computations at the Edge

Fast Distributed Inference Serving for Large Language Models

AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing and Data Locality

Performance Evaluation of Snapshot Methods to Warm the Serverless Cold Start

Managing Cold-start in The Serverless Cloud with Temporal Convolutional Networks

Efficient Deployment of Large Language Model Across Cloud-Device Systems