Abstract:Deep neural network (DNN) foundation models are currently exhibiting high prediction accuracy and strong adaptability to broad tasks with remarkably large model scales. They are increasingly becoming the backend support of DNN-driven real-time online services, e.g., Siri and Instagram. Such services require low-latency and cost-efficiency for quality-of-service and commercial competitiveness. When deployed in a cloud environment, these services call for an appropriate selection of cloud configurations (i.e., specific types of VM instances), as well as a considerate device placement plan that places the operations of the model to multiple GPUs via model parallelism for cost-efficiency. Currently, the deployment mainly relies on service providers’ manual efforts, which is not only onerous but also far from satisfactory oftentimes due to the huge joint search space of cloud configurations and device placement plans (for a same service, a poor deployment can incur significantly more costs by tens of times). In this paper, we attempt to efficiently automate the cloud deployment for real-time foundation model inference with minimum costs under the constraint of acceptably low latency. This attempt is enabled by 1) jointly leveraging the Bayesian Optimization and Deep Reinforcement Learning to adaptively unearth the (nearly) optimal cloud configuration and device placement with limited search time, and 2) enhancing the cost-efficiency of the deployment based on the probing-informed block multiplexing mechanism and Tensor Algebra SuperOptimizer. We implement a prototype system based on TensorFlow, conduct extensive experiments on top of Microsoft Azure, and demonstrate the generality and scalability of our solution. Results show that for lightweight DNN models and foundation models, our solution essentially saves inference costs by up to 15% and 47% with 57% and 38% lower search overheads respectively, compared with non-trivial baselines.

Automating Cloud Deployment for Deep Learning Inference of Real-time Online Services

Automating Cloud Deployment for Real-Time Online Foundation Model Inference

Review of Inference Time Prediction Approaches of DNN: Emphasis on Service Robots with Cloud-Edge-device Architecture

Extendable Multi-Device Collaborative Pipeline Parallel Inference in the Edge-Cloud Scenario

Online Learning for Orchestration of Inference in Multi-User End-Edge-Cloud Networks

Collaborative Cloud-Edge Service Cognition Framework for DNN Configuration toward Smart IIoT

Edge-Cloud Cooperation for DNN Inference Via Reinforcement Learning and Supervised Learning

A Cloud-Edge Collaboration Framework for Cognitive Service.

KAIROS: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources

Differentiate Quality of Experience Scheduling for Deep Learning Inferences with Docker Containers in the Cloud

Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing

DNN Deployment, Task Offloading, and Resource Allocation for Joint Task Inference in IIoT

Collaborative on-demand dynamic deployment via deep reinforcement learning for IoV service in multi edge clouds

AutoScale: Optimizing Energy Efficiency of End-to-End Edge Inference under Stochastic Variance

Deep Reinforcement Learning based Approach for Online Service Placement and Computation Resource Allocation in Edge Computing

Empowering In-Browser Deep Learning Inference on Edge Through Just-In-Time Kernel Optimization.

Efficient Architecture Paradigm for Deep Learning Inference As a Service.

Deep Neural Network Hardware Deployment Optimization via Advanced Active Learning

Efficient Deployment of Large Language Model Across Cloud-Device Systems

Enabling Flexible Resource Allocation in Mobile Deep Learning Systems

Characterizing the Deep Neural Networks Inference Performance of Mobile Applications