Optimizing Deep Learning Frameworks Incrementally to Get Linear Speedup: A Comparison Between IPoIB and RDMA Verbs

Chang Liu,Jianwen Wei,Yi-Chao Wang,Minhua Wen,Simon See,James Lin
DOI: https://doi.org/10.1109/padsw.2018.8644531
2018-01-01
Abstract:Deeper models and larger datasets are two major ingredients for applying deep learning (DL) on real-world problems, which inevitably shifts model training from on a single GPU card to on a GPU clusters due to limited GPU memory and time-to-solution requirements. High-speed low-latency RDMA-capable network fabrics like Infiniband and RoCE play an important role on coping with enoumous data exchanged during training. DL frameworks are built upon these fabrics with various APIs including IPoIB, MPI and RDMA Verbs. Tradeoffs are made between performance and usability when adapting DL frameworks onto RDMA-capable networks, which may result in high-performance yet hard-to-maintain and hard-to-merge code if improper design choices are made. This paper presents our approach to adapt MXNet, a modular versatile DL framework onto RDMA-capable networks. Dividing the training process on MXN et into P2P communication and A11Reduce commnunication, we add incremental optimizations on its message passing code. Experiments show that our approach exhibits near-linear speedups, whose parallel efficiency reaches 96% compared to 53% of the original IPoIB version when scaling to 100 GPU cards. In contrast to other MPI-based porting approach, our modifications are limited within MXNet's Parameter Server module, which is transparent for upper-layer operations, thus making no sacrifice on features like auto recovery and user-controlled consistency view.
What problem does this paper attempt to address?