Distributed deep learning system for cancerous region detection on Sunway TaihuLight

GuoFeng Lv,MingFan Li,Hong An,Han Lin,Junshi Chen,Wenting Han,Qian Xiao,Fei Wang,Rongfen Lin
DOI: https://doi.org/10.1007/s42514-020-00046-5
2020-09-15
CCF Transactions on High Performance Computing
Abstract:To explore the potential of distributed training on deep neural networks, we implement several distributed algorithms with the basis of swFlow on the world-leading supercomputer, Sunway TaihuLight. Based on two naive designs of parameter server and ring all-reduce, we present the limitation of the communication model and discuss the optimizations for adapting the five-level interconnect architecture of Sunway system. To reduce the communication bottleneck on large scale system, multi-severs and hierarchical ring all-reduce models are introduced. With a benchmark from deep learning-based cancerous region detection algorithm, the average parallel efficiency obtains over 80% for at most 1024 processors. It reveals the great opportunity for joint combination of deep learning and HPC system.
What problem does this paper attempt to address?