Abstract:Deep neural network (DNN) foundation models are currently exhibiting high prediction accuracy and strong adaptability to broad tasks with remarkably large model scales. They are increasingly becoming the backend support of DNN-driven real-time online services, e.g., Siri and Instagram. Such services require low-latency and cost-efficiency for quality-of-service and commercial competitiveness. When deployed in a cloud environment, these services call for an appropriate selection of cloud configurations (i.e., specific types of VM instances), as well as a considerate device placement plan that places the operations of the model to multiple GPUs via model parallelism for cost-efficiency. Currently, the deployment mainly relies on service providers’ manual efforts, which is not only onerous but also far from satisfactory oftentimes due to the huge joint search space of cloud configurations and device placement plans (for a same service, a poor deployment can incur significantly more costs by tens of times). In this paper, we attempt to efficiently automate the cloud deployment for real-time foundation model inference with minimum costs under the constraint of acceptably low latency. This attempt is enabled by 1) jointly leveraging the Bayesian Optimization and Deep Reinforcement Learning to adaptively unearth the (nearly) optimal cloud configuration and device placement with limited search time, and 2) enhancing the cost-efficiency of the deployment based on the probing-informed block multiplexing mechanism and Tensor Algebra SuperOptimizer. We implement a prototype system based on TensorFlow, conduct extensive experiments on top of Microsoft Azure, and demonstrate the generality and scalability of our solution. Results show that for lightweight DNN models and foundation models, our solution essentially saves inference costs by up to 15% and 47% with 57% and 38% lower search overheads respectively, compared with non-trivial baselines.

STRONGHOLD: Fast and Affordable Billion-Scale Deep Learning Model Training

DaDianNao: A Machine-Learning Supercomputer

Enabling Large Batch Size Training for DNN Models Beyond the Memory Limit While Maintaining Performance

An Efficient 2D Method for Training Super-Large Deep Learning Models

Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

MPress: Democratizing Billion-Scale Model Training on Multi-GPU Servers Via Memory-Saving Inter-Operator Parallelism

Distributed Training Large-Scale Deep Architectures

FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs

Data-parallel distributed training of very large models beyond GPU capacity

ZeRO-Offload: Democratizing Billion-Scale Model Training

Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining

Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity

M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining

MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud

Deep Neural Network Hardware Deployment Optimization via Advanced Active Learning

Survey on Large Scale Neural Network Training

Hydra: A System for Large Multi-Model Deep Learning

Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes

Automating Cloud Deployment for Real-Time Online Foundation Model Inference