G-SLIDE: A GPU-Based Sub-Linear Deep Learning Engine via LSH Sparsification
Zaifeng Pan,Feng Zhang,Hourun Li,Chenyang Zhang,Xiaoyong Du,Dong Deng
DOI: https://doi.org/10.1109/tpds.2021.3132493
IF: 5.3
2021-01-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:Deep learning has been one of the trendiest research topics. However, as data quantities rise exponentially, training large neural networks can become prohibitively expensive with billions of parameters. Fortunately, recent research has discovered that not all of the computations in traditional network training are necessary. By selectively sparsifying the majority of the neurons during training, we can still obtain acceptable accuracy. SLIDE, a C++ OpenMP-based sub-linear deep learning engine, has been developed in this situation. SLIDE uses the algorithm of locality sensitive hashing (LSH) to query neurons with high activation in sub-linear time. It achieves a remarkable speedup in training large fully-connected networks by making use of the network sparsity as well as multi-core parallelism. However, SLIDE is limited to CPUs, ignoring the popular GPU devices with greater parallel potential and computational capability. In this article, we propose G-SLIDE, a GPU-based sub-linear deep learning engine, which combines the benefits of SLIDE’s adaptive sparsification algorithms with GPUs’ high performance. The main challenges in developing G-SLIDE are efficiently using LSH to sparsify networks and training the special sparse neural networks on the GPU. To address these challenges, we propose several novel solutions, such as specific data formats and appropriate workload partitioning for threads to fully utilize the GPU resources. We evaluate G-SLIDE on two extremely sparse datasets with a 2080 Ti GPU, and the results demonstrate that for the time of one training epoch, G-SLIDE can achieve more than 16.4× speedup over SLIDE on a 32-core/64-thread CPU. Furthermore, on the same platform, G-SLIDE can earn an average of 16.2× speedup over TensorFlow-GPU and 30.8× speedup over TensorFlow-CPU.
computer science, theory & methods,engineering, electrical & electronic