Out-of-Core GPU Gradient Boosting

Rong Ou
DOI: https://doi.org/10.48550/arXiv.2005.09148
2020-05-19
Abstract:GPU-based algorithms have greatly accelerated many machine learning methods; however, GPU memory is typically smaller than main memory, limiting the size of training data. In this paper, we describe an out-of-core GPU gradient boosting algorithm implemented in the XGBoost library. We show that much larger datasets can fit on a given GPU, without degrading model accuracy or training time. To the best of our knowledge, this is the first out-of-core GPU implementation of gradient boosting. Similar approaches can be applied to other machine learning algorithms
Machine Learning,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to use GPU to accelerate the training of Gradient Boosting models for large - scale datasets when GPU memory is limited. Specifically, the author proposes a GPU - based out - of - core gradient boosting algorithm to overcome the limitation that GPU memory capacity is smaller than main memory, so that larger - scale datasets can be processed without reducing model accuracy or significantly increasing training time. ### Problem Background 1. **GPU Memory Limitation** - GPU memory is usually much smaller than main memory. For example, the AWS p3.2xlarge instance is equipped with 16 GiB of GPU memory and 61 GiB of main memory. - When the training dataset is large, it is easy to cause GPU memory shortage, even if there is still a large amount of available space in the main memory. 2. **Limitations of Existing Solutions** - Existing GPU gradient boosting implementations rely on loading all data into GPU memory for calculation, which limits the size of the datasets that can be processed. - Libraries such as XGBoost support external memory modes, but they are mainly for CPUs and do not fully utilize the acceleration capabilities of GPUs. ### Solutions Proposed in the Paper The author solves the above problems through the following methods: 1. **Out - of - Core GPU Gradient Boosting Algorithm** - An out - of - core GPU gradient boosting algorithm is proposed, so that the training data can be partially stored on the disk and loaded into GPU memory for calculation on demand. - Through carefully designed data access patterns and gradient sampling techniques, the occupation of GPU memory is reduced while maintaining high training efficiency. 2. **Incremental Quantization Generation** - In the pre - processing stage, the quantiles of features are generated incrementally and compressed into the ELLPACK format to reduce the amount of data. 3. **External ELLPACK Matrix** - The training data is divided into multiple ELLPACK pages and loaded from the disk into GPU memory for processing as needed. 4. **Incremental Tree Construction** - When constructing decision trees, ELLPACK pages are processed in a streaming manner to avoid loading all data into GPU memory at once. 5. **Gradient - Based Sampling** - Gradient sampling techniques such as Minimum Variance Sampling (MVS) are used to further reduce the amount of data that needs to be processed, thereby increasing the training speed. ### Experimental Results - **Dataset Size** - Combined with gradient sampling, the out - of - core mode allows processing datasets with up to 85 million rows and 500 columns on a 16 GiB GPU, while only about 13 million rows can be processed without using sampling. - **Model Accuracy** - Experiments on the Higgs dataset show that the model performance is basically the same under different sampling rates, and only when the sampling rate is very low (such as 0.1), the performance drops slightly. - **Training Time** - Although the out - of - core GPU training time is slightly slower than the in - memory version, it is still significantly faster than the CPU version, especially when using sampling techniques. In conclusion, this paper proposes an effective out - of - core GPU gradient boosting algorithm, which significantly expands the scale of datasets that a single GPU can handle while maintaining model accuracy and training efficiency.