What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How to use GPU to accelerate the training of Gradient Boosting models for large - scale datasets when GPU memory is limited. Specifically, the author proposes a GPU - based out - of - core gradient boosting algorithm to overcome the limitation that GPU memory capacity is smaller than main memory, so that larger - scale datasets can be processed without reducing model accuracy or significantly increasing training time. ### Problem Background 1. **GPU Memory Limitation** - GPU memory is usually much smaller than main memory. For example, the AWS p3.2xlarge instance is equipped with 16 GiB of GPU memory and 61 GiB of main memory. - When the training dataset is large, it is easy to cause GPU memory shortage, even if there is still a large amount of available space in the main memory. 2. **Limitations of Existing Solutions** - Existing GPU gradient boosting implementations rely on loading all data into GPU memory for calculation, which limits the size of the datasets that can be processed. - Libraries such as XGBoost support external memory modes, but they are mainly for CPUs and do not fully utilize the acceleration capabilities of GPUs. ### Solutions Proposed in the Paper The author solves the above problems through the following methods: 1. **Out - of - Core GPU Gradient Boosting Algorithm** - An out - of - core GPU gradient boosting algorithm is proposed, so that the training data can be partially stored on the disk and loaded into GPU memory for calculation on demand. - Through carefully designed data access patterns and gradient sampling techniques, the occupation of GPU memory is reduced while maintaining high training efficiency. 2. **Incremental Quantization Generation** - In the pre - processing stage, the quantiles of features are generated incrementally and compressed into the ELLPACK format to reduce the amount of data. 3. **External ELLPACK Matrix** - The training data is divided into multiple ELLPACK pages and loaded from the disk into GPU memory for processing as needed. 4. **Incremental Tree Construction** - When constructing decision trees, ELLPACK pages are processed in a streaming manner to avoid loading all data into GPU memory at once. 5. **Gradient - Based Sampling** - Gradient sampling techniques such as Minimum Variance Sampling (MVS) are used to further reduce the amount of data that needs to be processed, thereby increasing the training speed. ### Experimental Results - **Dataset Size** - Combined with gradient sampling, the out - of - core mode allows processing datasets with up to 85 million rows and 500 columns on a 16 GiB GPU, while only about 13 million rows can be processed without using sampling. - **Model Accuracy** - Experiments on the Higgs dataset show that the model performance is basically the same under different sampling rates, and only when the sampling rate is very low (such as 0.1), the performance drops slightly. - **Training Time** - Although the out - of - core GPU training time is slightly slower than the in - memory version, it is still significantly faster than the CPU version, especially when using sampling techniques. In conclusion, this paper proposes an effective out - of - core GPU gradient boosting algorithm, which significantly expands the scale of datasets that a single GPU can handle while maintaining model accuracy and training efficiency.

Out-of-Core GPU Gradient Boosting

XGBoost: Scalable GPU Accelerated Learning

BOOST: Block Minifloat-Based On-Device CNN Training Accelerator with Transfer Learning

Benchmarking and Optimization of Gradient Boosting Decision Tree Algorithms

CuMF_SGD: Parallelized Stochastic Gradient Descent for Matrix Factorization on GPUs.

GPU-acceleration for Large-scale Tree Boosting

Parallel L-BFGS-B Algorithm on GPU.

Breaking the Memory Wall: A Study of I/O Patterns and GPU Memory Utilization for Hybrid CPU-GPU Offloaded Optimizers

A Fast Sampling Gradient Tree Boosting Framework

Mini-batch Gradient Descent with Buffer

Scalable and Fast SVM Regression Using Modern Hardware.

MASCOT: Fast and Highly Scalable SVM Cross-Validation Using GPUs and SSDs

Accelerated CNN Training Through Gradient Approximation

RoNGBa: A Robustly Optimized Natural Gradient Boosting Training Approach with Leaf Number Clipping

Optimization of GPU Memory Usage for Training Deep Neural Networks.

Accelerating Gradient Boosting Machine

Optimal Gradient Checkpoint Search for Arbitrary Computation Graphs

Accelerated Doubly Stochastic Gradient Algorithm for Large-scale Empirical Risk Minimization

Memory-Efficient Gradient Unrolling for Large-Scale Bi-level Optimization

High Performance Computing Applied to Logistic Regression: A CPU and GPU Implementation Comparison

CatBoost: gradient boosting with categorical features support