STRONGHOLD: Fast and Affordable Billion-Scale Deep Learning Model Training

Xiaoyang Sun,Wei Wang,Shenghao Qiu,Renyu Yang,Songfang Huang,Jie Xu,Zheng Wang
DOI: https://doi.org/10.1109/sc41404.2022.00076
2022-01-01
Abstract:Deep neural networks (DNNs) with billion-scale parameters have demonstrated impressive performance in solving many tasks. Unfortunately, training a billion-scale DNN is out of the reach of many data scientists because it requires high-performance GPU servers that are too expensive to purchase and maintain. We present STRONGHOLD, a novel approach for enabling large DNN model training with no change to the user code. STRONGHOLD scales up the largest trainable model size by dynamically offloading data to the CPU RAM and enabling the use of secondary storage. It automatically determines the minimum amount of data to be kept in the GPU memory to minimize GPU memory usage. Compared to state-of-the-art offloading-based solutions, STRONGHOLD improves the trainable model size by 1.9x~6. Sx on a 32GB V100 GPU, with 1.2x~3.7x improvement on the training throughput. It has been deployed into production to successfully support large-scale DNN training.
What problem does this paper attempt to address?