Abstract:Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the efficiency and scalability of the Tree Boosting method in large - scale machine learning tasks. Specifically, the paper introduces XGBoost, an efficient and scalable end - to - end tree - boosting system. XGBoost has made innovations in multiple aspects to achieve the following goals: 1. **Handling Sparse Data**: A new sparse - aware algorithm is proposed, which can effectively handle missing values and sparse features in data, thereby improving computational efficiency. 2. **Approximate Tree Learning**: The Weighted Quantile Sketch technique is introduced for split - point selection in approximate tree learning, which is especially important when dealing with large - scale datasets. 3. **Cache Access Pattern Optimization**: By designing an efficient cache - aware block structure, the memory access latency is reduced and the running speed of the system is increased. 4. **Data Compression and Sharding**: Data compression and sharding techniques are utilized to achieve efficient storage and processing of large - scale data and support datasets that exceed memory limits. 5. **Parallel and Distributed Computing**: Through parallel and distributed computing techniques, the model training process is accelerated, enabling XGBoost to handle billions of data records in single - machine or multi - machine environments. These innovations have enabled XGBoost to achieve remarkable results in multiple machine - learning challenges. In particular, in Kaggle competitions, XGBoost is widely used and has become the main tool for winning teams in multiple competitions. Through theoretical analysis and experimental verification, the paper demonstrates the superior performance and scalability of XGBoost when handling large - scale datasets.

XGBoost: A Scalable Tree Boosting System

TencentBoost: A Gradient Boosting Tree System with Parameter Server

A Hybrid-Domain Framework for Secure Gradient Tree Boosting.

XGBoost: Scalable GPU Accelerated Learning

DimBoost

SecureBoost+: Large Scale and High-Performance Vertical Federated Gradient Boosting Decision Tree

Scaling Up Diffusion and Flow-based XGBoost Models

Poster: gbdt-rs: Fast and Trustworthy Gradient Boosting Decision Tree

A Fast Sampling Gradient Tree Boosting Framework

TF Boosted Trees: A scalable TensorFlow based framework for gradient boosting

agtboost: Adaptive and Automatic Gradient Tree Boosting Computations

Tree-Structured Boosting: Connections Between Gradient Boosted Stumps and Full Decision Trees

Secure Collaborative Training and Inference for XGBoost

Boosting: An Ensemble Learning Tool for Compound Classification and QSAR Modeling

Benchmarking and Optimization of Gradient Boosting Decision Tree Algorithms

Compact Multi-Class Boosted Trees

Extreme Gradient Boosting with Squared Logistic Loss Function

DP-XGBoost: Private Machine Learning at Scale

XBNet : An Extremely Boosted Neural Network

A Simple and Fast Baseline for Tuning Large XGBoost Models

TransBoost: A Boosting-Tree Kernel Transfer Learning Algorithm for Improving Financial Inclusion