CuWide: Towards Efficient Flow-based Training for Sparse Wide Models on GPUs
Xupeng Miao,Lingxiao Ma,Zhi Yang,Yingxia Shao,Bin Cui,Lele Yu,Jiawei Jiang
DOI: https://doi.org/10.1109/tkde.2020.3038109
IF: 9.235
2020-01-01
IEEE Transactions on Knowledge and Data Engineering
Abstract:Wide models such as generalized linear models and factorization-based models have been extensively used in various predictive applications, e.g., recommendation, CTR prediction, and image recognition. Due to the memory bounded property of the models, the performance improvement on CPU is reaching the limitation. GPU is known to have many computation units and high memory bandwidth, and becomes a promising platform for training machine learning models. However, the GPU training for the wide models is far from optimal due to the sparsity and irregularity in wide models. The existing GPU-based wide models are even slower than the ones using CPU. The classical training schema of the wide models does not optimized for the GPU architecture, which suffers from large amount of random memory accesses and redundant read/write of intermediate values. In this paper, we propose an efficient GPU-training framework for the large-scale wide models, named cuWide. To fully benefit from the memory hierarchy of GPU, cuWide applies a new flow-based schema for training, which leverages the spatial and temporal locality of wide models to drastically reduce the amount of communication with GPU global memory. To do so, we adopt a bigraph computation model to efficiently realize the flow-based schema and exploit three flexible interfaces for programming. Further, we use the 2D partition of mini-batch (in sample and feature dimensions) with proposed graph abstraction to optimize GPU memory access for sparse data, and apply several spatial-temporal caching mechanisms (importance-based model caching and cross-stage accumulation caching mechanisms) to achieve a high performance kernel. To efficiently implement cuWide, we also propose several GPU-oriented optimizations, including feature-oriented data layout to enhance the data locality, replication mechanism to reduce update conflicts in shared memory, and multi-stream scheduling to overlap data transferring and kernel computing. We show that c-Wide can be up to more than 20× faster than the state-of-the-art GPU solutions and multi-core CPU solutions.
computer science, information systems, artificial intelligence,engineering, electrical & electronic