Parallel Training GBRT Based on KMeans Histogram Approximation for Big Data.

Rong Gu,Lei Jin,Yongwei Wu,Jingying Qu,Tao Wang,Xiaojun Wang,Chunfeng Yuan,Yihua Huang
DOI: https://doi.org/10.1007/978-3-319-27122-4_4
2015-01-01
Abstract:Gradient Boosting Regression Tree GBRT, one of the state-of-the-art ranking algorithms widely used in industry, faces challenges in the big data era. With the rapid increase in the sizes of datasets, the iterative training process of GBRT becomes very time-consuming over large scale data. In this paper, we aim to speed up the training process of each tree in the GBRT framework. First, we propose a novel KMeans histogram building algorithm which has lower time complexity and is more efficient than the cutting-edge histogram building method. Further, we put forward an approximation algorithm by combining the kernel density estimation with the histogram technique to improve the accuracy. We conduct a variety of experiments on both the public Learning To RankLTR benchmark datasets and the large-scale real-world datasets from Baidu search engine. The experimental results show that our proposed parallel training algorithm outperforms the state-of-the-art parallel GBRT algorithm with near 2 times speedup and better accuracy. Also, our algorithm achieves the near-linear scalability.
What problem does this paper attempt to address?