Autodist: a composable and automated synchronization system for distributed deep learning

Hao Zhang,Peng Wu,Zhijie Deng,Christy Li,Qirong Ho,Aurick Qiao,Zeya Wang,Eric P. Xing
2021-01-01
Abstract:Efficient data-parallel distributed training has been a key driver behind recent innovations in deep learning (DL). However, achieving satisfactory distributed performance involves making difficult system-level decisions related to diverse synchronization aspects. We present AutoDist, which automatically composes parallel synchronization strategies for DL models by rewriting their original dataflow graphs into parallel versions. Unlike existing training systems with fixed strategies, AutoDist adaptively composes strategies by jointly optimizing multiple aspects, each applied to different parts of the DL model. Compared to other graph rewriting systems, AutoDist deliberately breaks seemingly distinct synchronization optimizations into atomic graph rewriting kernels, and allows mechanically assembling them to express new strategies that extrapolate to new models and clusters. We show that AutoDist can find high-performance strategies quickly, and enables model training 1.2x to 1.6x faster than hand-optimized baselines. Critically, AutoDist does not require manual tuning when faced with new DL models or cluster configurations.
What problem does this paper attempt to address?