HPDL: Towards a General Framework for High-performance Distributed Deep Learning.

Dongsheng Li,Zhiquan Lai,Keshi Ge,Yiming Zhang,Zhaoning Zhang,Tao Sun,Qinglin Wang,Huaimin Wang
DOI: https://doi.org/10.1109/icdcs.2019.00173
2019-01-01
Abstract:With growing scale of the data volume and neural network size, we have come into the era of distributed deep learning. High-performance training and inference on distributed computing systems has been attracting increasing research attention in both academia and industry. Meanwhile, diversity of existing machine learning frameworks (e.g. TensorFlow, Pytorch and MXNet) and the explosion of deep learning hardwares (e.g. CPUs, GPUs, FPGAs and ASICs) bring more challenges for users to leverage new deep learning technologies and accelerating capability of hardware devices. We firstly search around the state-of-the-art work in the area which open our mind to take a vision upon the future deep learning framework. Then, we propose HPDL, a general framework for high-performance distributed deep learning which is compatible with existing frameworks and adaptive to various hardware architectures. At last, we discuss and foresee the key technologies fulfilling high-performance and large-scale deep learning, including optimization algorithm, hybrid communication mechanism, model parallelization, resource scheduling and single-node execution optimization.
What problem does this paper attempt to address?