Abstract:Leveraging serverless computing for cloud-based machine learning services is on the rise, promising cost-efficiency and flexibility are crucial for ML applications relying on high-performance GPUs and substantial memory. However, despite modern serverless platforms handling diverse devices like GPUs seamlessly on a pay-as-you-go basis, a longstanding challenge remains: startup latency, a well-studied issue when serverless is CPU-centric. For example, initializing GPU apps with minor GPU models, like MobileNet, demands several seconds. For more intricate models such as GPT-2, startup latency can escalate to around 10 seconds, vastly overshadowing the short computation time for GPU-based inference. Prior solutions tailored for CPU serverless setups, like fork() and Checkpoint/Restore, cannot be directly and effectively applied due to differences between CPUs and GPUs. This paper presents gCROP (GPU Checkpoint/Restore made On-demand and Parallel), the first GPU runtime that achieves <100ms startup latency for GPU apps with up to 774 million parameters (3.1GB GPT-2-Large model). The key insight behind gCROP is to selectively restore essential states on demand and in parallel during boot from a prepared checkpoint image. To this end, gCROP first introduces a global service, GPU Restore Server, which can break the existing barrier between restore stages and achieve parallel restore. Besides, gCROP leverages both CPU and GPU page faults, and can on-demand restore both CPU and GPU data with profile-guided order to mitigate costs caused by faults. Moreover, gCROP designs a multi-checkpoint mechanism to increase the common contents among checkpoint images and utilizes deduplication to reduce storage costs. Implementation and evaluations on AMD GPUs show significant improvement in startup latency, 6.4x-24.7x compared with booting from scratch and 3.9x-23.5x over the state-of-the-art method (CRIU).

On-demand and Parallel Checkpoint/Restore for GPU Applications

PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation

CRState: In-Kernel Checkpoint/Restart of OpenCL Program Execution on GPU

gHA: An Efficient and Iterative Checkpointing Mechanism for Virtualized GPUs

TURNIP: A "Nondeterministic" GPU Runtime with CPU RAM Offload

Transparent Checkpoint-Restart for Hardware-Accelerated 3D Graphics

CRState: Checkpoint/restart of OpenCL Program for In-Kernel Applications

Optimal Gradient Checkpoint Search for Arbitrary Computation Graphs

ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in Cloud Environments

Gremote: Cloud Rendering on GPU Resource Pool Based on API-forwarding

A server-based approach for predictable GPU access with improved analysis

Improving GPU Multi-Tenancy Through Dynamic Multi-Instance GPU Reconfiguration

Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences

CRAC: an Automatic Assistant Compiler of Checkpoint/restart for OpenCL Program

GPU First -- Execution of Legacy CPU Codes on GPUs

Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing

Chrion: Optimizing Recurrent Neural Network Inference by Collaboratively Utilizing CPUs and GPUs

FastPersist: Accelerating Model Checkpointing in Deep Learning

ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development

Efficient GPU Spatial-Temporal Multitasking

RTGPU: Real-Time GPU Scheduling of Hard Deadline Parallel Tasks With Fine-Grain Utilization