High-Level Data Abstraction and Elastic Data Caching for Data-Intensive AI Applications on Cloud-Native Platforms

Rong Gu,Zhihao Xu,Yang Che,Xu Wang,Haipeng Dai,Kai Zhang,Bin Fan,Haojun Hou,Li Yi,Yu Ding,Yihua Huang,Guihai Chen
DOI: https://doi.org/10.1109/tpds.2023.3314659
IF: 5.3
2023-01-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:Nowdays, it is prevalent to train deep learning models in cloud-native platforms that actively leverage containerization and orchestration technologies for high elasticity, low and flexible operation cost, and many other benefits. However, it also faces new challenges and our work is focusing on those related to I/O throughput for training, including complex data access, lack of matching dynamic I/O requirement, and inefficient I/O resource scheduling across different jobs. We propose Fluid , a cloud-native platform that provides DL training jobs with high-level data abstraction called Fluid Dataset to access training data from heterogeneous sources with elastic data acceleration. In addition, it comes with an on-the-fly cache system autoscaler that can match the online training speed and increase the number of cache replicas adaptively to alleviate I/O bottlenecks. To improve the overall performance of multiple DL jobs, Fluid co-orchestrate the data cache and DL jobs by arranging job scheduling in an appropriate order and can also schedule data cache and DL jobs on the same node to realize cache affinity. Experimental results show significant performance improvement of each individual DL job which uses dynamic computing resources with Fluid. For scheduling multiple DL jobs with same datasets, Fluid achieves around 2x performance speedup when integrated with existing widely-used and cutting-edge scheduling solutions through the appropriate job scheduling order. Besides, the cache affinity scheduling policy also improves job execution performance significantly. Fluid is now an open source project hosted by Cloud Native Computing Foundation (CNCF) with many production adopters.
What problem does this paper attempt to address?