Abstract:Deep neural network (DNN) foundation models are currently exhibiting high prediction accuracy and strong adaptability to broad tasks with remarkably large model scales. They are increasingly becoming the backend support of DNN-driven real-time online services, e.g., Siri and Instagram. Such services require low-latency and cost-efficiency for quality-of-service and commercial competitiveness. When deployed in a cloud environment, these services call for an appropriate selection of cloud configurations (i.e., specific types of VM instances), as well as a considerate device placement plan that places the operations of the model to multiple GPUs via model parallelism for cost-efficiency. Currently, the deployment mainly relies on service providers’ manual efforts, which is not only onerous but also far from satisfactory oftentimes due to the huge joint search space of cloud configurations and device placement plans (for a same service, a poor deployment can incur significantly more costs by tens of times). In this paper, we attempt to efficiently automate the cloud deployment for real-time foundation model inference with minimum costs under the constraint of acceptably low latency. This attempt is enabled by 1) jointly leveraging the Bayesian Optimization and Deep Reinforcement Learning to adaptively unearth the (nearly) optimal cloud configuration and device placement with limited search time, and 2) enhancing the cost-efficiency of the deployment based on the probing-informed block multiplexing mechanism and Tensor Algebra SuperOptimizer. We implement a prototype system based on TensorFlow, conduct extensive experiments on top of Microsoft Azure, and demonstrate the generality and scalability of our solution. Results show that for lightweight DNN models and foundation models, our solution essentially saves inference costs by up to 15% and 47% with 57% and 38% lower search overheads respectively, compared with non-trivial baselines.

AMPS-Inf: Automatic Model Partitioning for Serverless Inference with Cost Efficiency

Design and implementation of efficient distributed deep learning model inference architecture on serverless computation

Cost-Efficient Serverless Inference Serving with Joint Batching and Multi-Processing.

MOPAR: A Model Partitioning Framework for Deep Learning Inference Services on Serverless Platforms

Extendable Multi-Device Collaborative Pipeline Parallel Inference in the Edge-Cloud Scenario

Costless: Optimizing Cost of Serverless Computing through Function Fusion and Placement

QoS-Aware and Cost-Efficient Dynamic Resource Allocation for Serverless ML Workflows.

MLLess: Achieving Cost Efficiency in Serverless Machine Learning Training

Automating Cloud Deployment for Real-Time Online Foundation Model Inference

SeSeMI: Secure Serverless Model Inference on Sensitive Data

A Deep Reinforcement Learning based Algorithm for Time and Cost Optimized Scaling of Serverless Applications

Stateful Serverless Application Placement in MEC with Function and State Dependencies

Taming Serverless Cold Start of Cloud Model Inference With Edge Computing

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models

Astrea: Auto-Serverless Analytics Towards Cost-Efficiency and QoS-Awareness

A Survey of Serverless Machine Learning Model Inference

Prediction-driven resource provisioning for serverless container runtimes

FSD-Inference: Fully Serverless Distributed Inference with Scalable Cloud Communication

SMSS: Stateful Model Serving in Metaverse with Serverless Computing and GPU Sharing

Demystifying the Cost of Serverless Computing: Towards a Win-Win Deal

Serverless inferencing on Kubernetes