Abstract:Training a high-accuracy model requires trying hundreds of configurations of hyperparameters to search for the optimal configuration. It is common to launch a group of training jobs (named cojob) with different configurations at the same time and stop the jobs performing worst every stage (i.e., a certain number of iterations). Thus deep learning requires minimizing stage completion time (SCT) to accelerate the searching. To quickly complete the stages, each job in the cojob typically uses multiple GPUs to perform distributed training. The GPUs exchange data per iteration to synchronize their models through the network. However, data transfers of DL jobs compete for network bandwidth since the GPU cluster hosts a number of cojobs from various users, resulting in network congestion and consequently a large SCT for cojobs. Existing flow schedulers aimed at reducing flow/coflow/job completion time mismatch the requirement of hyperparameter searching. In this paper, we implement a system Grouper to minimize average SCT for cojobs. Grouper adopts a well-designed algorithm to permute stages of cojobs and schedules flows from different stages in the order of the permutation. The extensive testbed experiments and simulations show that Grouper outperforms advanced network designs Baraat, Sincrona, and per-flow fair share.

Grouper: Accelerating Hyperparameter Searching in Deep Learning Clusters with Network Scheduling