XGBoost: A Scalable Tree Boosting System

Tianqi Chen,Carlos Guestrin
DOI: https://doi.org/10.1145/2939672.2939785
2016-06-11
Abstract:Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.
Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the efficiency and scalability of the Tree Boosting method in large - scale machine learning tasks. Specifically, the paper introduces XGBoost, an efficient and scalable end - to - end tree - boosting system. XGBoost has made innovations in multiple aspects to achieve the following goals: 1. **Handling Sparse Data**: A new sparse - aware algorithm is proposed, which can effectively handle missing values and sparse features in data, thereby improving computational efficiency. 2. **Approximate Tree Learning**: The Weighted Quantile Sketch technique is introduced for split - point selection in approximate tree learning, which is especially important when dealing with large - scale datasets. 3. **Cache Access Pattern Optimization**: By designing an efficient cache - aware block structure, the memory access latency is reduced and the running speed of the system is increased. 4. **Data Compression and Sharding**: Data compression and sharding techniques are utilized to achieve efficient storage and processing of large - scale data and support datasets that exceed memory limits. 5. **Parallel and Distributed Computing**: Through parallel and distributed computing techniques, the model training process is accelerated, enabling XGBoost to handle billions of data records in single - machine or multi - machine environments. These innovations have enabled XGBoost to achieve remarkable results in multiple machine - learning challenges. In particular, in Kaggle competitions, XGBoost is widely used and has become the main tool for winning teams in multiple competitions. Through theoretical analysis and experimental verification, the paper demonstrates the superior performance and scalability of XGBoost when handling large - scale datasets.