TensorOpt: Exploring the Tradeoffs in Distributed DNN Training With Auto-Parallelism

Zhenkun Cai,Xiao Yan,Kaihao Ma,Yidi Wu,Yuzhen Huang,James Cheng,Teng Su,Fan Yu
DOI: https://doi.org/10.1109/tpds.2021.3132413
IF: 5.3
2022-08-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:Effective parallelization strategies are crucial for the performance of distributed deep neural network (DNN) training. Recently, several methods have been proposed to search parallelization strategies but they all optimize a single objective (e.g., execution time, memory consumption) and produce only one strategy. We propose Frontier Tracking (FT), an efficient algorithm that finds a set of Pareto-optimal parallelization strategies to explore the best trade-off among different objectives. FT can minimize the memory consumption when the number of devices is limited and fully utilize additional resources to reduce the execution time. Based on FT, we develop a user-friendly system, called TensorOpt, which allows users to run their distributed DNN training jobs without caring the details about searching and coding parallelization strategies. Experimental results show that TensorOpt is more flexible in adapting to resource availability compared with existing frameworks.
computer science, theory & methods,engineering, electrical & electronic
What problem does this paper attempt to address?