BAFT: bubble-aware fault-tolerant framework for distributed DNN training with hybrid parallelism
Runzhe Chen,Guandong Lu,Yakai Wang,Rui Zhang,Zheng Hu,Yanming Miao,Zhifang Cai,Jingwen Leng,Minyi Guo
DOI: https://doi.org/10.1007/s11704-023-3401-5
IF: 2.6688
2024-11-13
Frontiers of Computer Science
Abstract:As deep neural networks (DNNs) have been successfully adopted in various domains, the training of these large-scale models becomes increasingly difficult and is often deployed on compute clusters composed of many devices like GPUs. However, as the size of the cluster increases, so does the possibility of failures during training. Currently, faults are mainly handled by recording checkpoints and recovering, but this approach causes large overhead and affects the training efficiency even when no error occurs. The low checkpointing frequency leads to a large loss of training time, while the high recording frequency affects the training efficiency. To solve this contradiction, we propose BAFT, a bubble-aware fault tolerant framework for hybrid parallel distributed training. BAFT can automatically analyze parallel strategies, profile the runtime information, and schedule checkpointing tasks at the granularity of pipeline stage depending on the bubble distribution in the training. It supports higher checkpoint efficiency and only introduces less than 1% time overhead, which allows us to record checkpoints at high frequency, thereby reducing the time loss in error recovery and avoiding the impact of fault tolerance on training.
computer science, information systems, theory & methods, software engineering