GoldMiner: Elastic Scaling of Training Data Pre-Processing Pipelines for Deep Learning.

Hanyu Zhao,Zhi Yang,Yu Cheng,Chao Tian,Shiru Ren,Wencong Xiao,Man Yuan,Langshi Chen,Kaibo Liu,Yang Zhang,Yong Li,Wei Lin
DOI: https://doi.org/10.1145/3589773
2023-01-01
Abstract:Training data pre-processing pipelines are essential to deep learning (DL). As the performance of model training keeps increasing with both hardware advancements (e.g., faster GPUs) and various software optimizations, the data pre-processing on CPUs is becoming more resource-intensive and a severe bottleneck of the pipeline. This problem is even worse in the cloud, where training jobs exhibit diverse CPU-GPU demands that usually result in mismatches with fixed hardware configurations and resource fragmentation, degrading both training performance and cluster utilization. We introduce GoldMiner, an input data processing service for stateless operations used in pre-processing data for DL model training. GoldMiner decouples data pre-processing from model training into a new role called the data worker. Data workers facilitate scaling of data pre-processing to anywhere in a cluster, effectively pooling the resources across the cluster to satisfy the diverse requirements of training jobs. GoldMiner achieves this decoupling in a fully automatic and elastic manner. The key insight is that data pre-processing is inherently stateless, thus can be executed independently and elastically. This insight guides GoldMiner to automatically extract stateless computation out of a monolithic training program, efficiently disaggregate it across data workers, and elastically scale data workers to tune the resource allocations across jobs to optimize cluster efficiency. We have applied GoldMiner to industrial workloads, and our evaluation shows that GoldMiner can transform unmodified training programs to use data workers, accelerating individual training jobs by up to 12.1x. GoldMiner also improves average job completion time and aggregate GPU utilization by up to 2.5x and 2.1x in a 64-GPU cluster, respectively, by scheduling data workers with elasticity.
What problem does this paper attempt to address?