Abstract:We describe the multi-GPU gradient boosting algorithm implemented in the XGBoost library (<a class="link-external link-https" href="https://github.com/dmlc/xgboost" rel="external noopener nofollow">this https URL</a>). Our algorithm allows fast, scalable training on multi-GPU systems with all of the features of the XGBoost library. We employ data compression techniques to minimise the usage of scarce GPU memory while still allowing highly efficient implementation. Using our algorithm we show that it is possible to process 115 million training instances in under three minutes on a publicly available cloud computing instance. The algorithm is implemented using end-to-end GPU parallelism, with prediction, gradient calculation, feature quantisation, decision tree construction and evaluation phases all computed on device.

What problem does this paper attempt to address?

This paper aims to solve the problem of low training efficiency of the Gradient Boosting algorithm on large - scale datasets. Specifically, the paper introduces a multi - GPU - accelerated Gradient Boosting algorithm, which implements all the functions in the XGBoost library and can be trained quickly and efficiently on multi - GPU systems. By using data compression techniques to reduce the use of GPU memory while maintaining an efficient implementation, the training speed and memory efficiency can be significantly improved when dealing with large - scale datasets. For example, the paper shows that 115 million training samples can be processed in less than three minutes on publicly available cloud computing instances. The main contributions of the paper include: - **Feature Quantile Generation**: Transform the input feature space into a quantile representation to accelerate the decision - tree construction process. - **Data Compression**: Compress the quantized matrix to reduce GPU memory consumption and support larger - scale datasets. - **Decision - Tree Construction**: Develop a multi - GPU decision - tree construction algorithm and optimize the selection of split points through parallel prefix - sum operations. - **Prediction and Gradient Calculation**: Map operations such as prediction and gradient calculation to be executed on the GPU, taking advantage of the high bandwidth and parallel computing capabilities of the GPU to significantly improve performance. These improvements make the XGBoost library more efficient when dealing with large - scale datasets, especially in a multi - GPU environment.

XGBoost: Scalable GPU Accelerated Learning