Fluid: Dataset Abstraction and Elastic Acceleration for Cloud-native Deep Learning Training Jobs

Rong Gu,Kai Zhang,Zhihao Xu,Yang Che,Bin Fan,Haojun Hou,Haipeng Dai,Li Yi,Yu Ding,Guihai Chen,Yihua Huang
DOI: https://doi.org/10.1109/icde53745.2022.00209
2022-01-01
Abstract:Nowdays, it is prevalent to train deep learning (DL) models in cloud-native platforms that actively leverage containerization and orchestration technologies for high elasticity, low and flexible operation cost, and many other benefits. However, it also faces new challenges and our work is focusing on those related to I/O throughput for training, including complex data access with complicated performance tuning, lack of cache capacity with specialized hardware to match its high and dynamic I/O requirement, and inefficient I/O resource sharing across different training jobs. We propose Fluid, a cloud-native platform that provides DL training jobs with a data abstraction called Fluid Dataset to access training data from heterogeneous sources in a unified manner with transparent and elastic data acceleration powered by auto-tuned cache runtimes. In addition, it comes with an on-the-fly cache system autoscaler that can intelligently scale up and down the cache capacity to match the online training speed of each individual DL job. To improve the overall performance of multiple DL jobs, Fluid can co-orchestrate the data cache and DL jobs by arranging job scheduling in an appropriate order. Our experimental results show significant performance improvement of each individual DL job which uses dynamic computing resources with Fluid. In addition, for scheduling multiple DL jobs with same datasets, Fluid gives around 2x performance speedup when integrated with existing widely-used and cutting-edge scheduling solutions. Fluid is now an open source project hosted by Cloud Native Computing Foundation (CNCF) with adopters in production including Alibaba Cloud, Tencent Cloud, Weibo.com, China Telecom, etc.
What problem does this paper attempt to address?