Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances

Jiangfei Duan,Ziang Song,Xupeng Miao,Xiaoli Xi,Dahua Lin,Harry Xu,Minjia Zhang,Zhihao Jia
2024-03-21
Abstract:Deep neural networks (DNNs) are becoming progressively large and costly to train. This paper aims to reduce DNN training costs by leveraging preemptible instances on modern clouds, which can be allocated at a much lower price when idle but may be preempted by the cloud provider at any time. Prior work that supports DNN training on preemptive instances employs a reactive approach to handling instance preemptions and allocations after their occurrence, which only achieves limited performance and scalability.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?