ElasticDL: A Kubernetes-native Deep Learning Framework with Fault-tolerance and Elastic Scheduling

Jun Zhou,Ke Zhang,Feng Zhu,Qitao Shi,Wenjing Fang,Lin Wang,Yi Wang
DOI: https://doi.org/10.1145/3539597.3573037
2023-01-01
Abstract:The power of artificial intelligence (AI) models originates with sophisticated model architecture as well as the sheer size of the model. These large-scale AI models impose new and challenging system requirements regarding scalability, reliability, and flexibility. One of the most promising solutions in the industry is to train these large-scale models on distributed deep-learning frameworks. With the power of all distributed computations, it is desired to achieve a training process with excellent scalability, elastic scheduling (flexibility), and fault tolerance (reliability). In this paper, we demonstrate the scalability, flexibility, and reliability of our open-source Elastic Deep Learning (ElasticDL) framework. Our ElasticDL utilizes an open-source system, i.e., Kubernetes, for automating deployment, scaling, and management of containerized application features to provide fault tolerance and support elastic scheduling for DL tasks.
What problem does this paper attempt to address?