CatBoost: gradient boosting with categorical features support

Anna Veronika Dorogush,Vasily Ershov,Andrey Gulin
DOI: https://doi.org/10.48550/arXiv.1810.11363
2018-10-24
Abstract:In this paper we present CatBoost, a new open-sourced gradient boosting library that successfully handles categorical features and outperforms existing publicly available implementations of gradient boosting in terms of quality on a set of popular publicly available datasets. The library has a GPU implementation of learning algorithm and a CPU implementation of scoring algorithm, which are significantly faster than other gradient boosting libraries on ensembles of similar sizes.
Machine Learning,Mathematical Software
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are improving the performance of the gradient boosting algorithm when dealing with categorical features and reducing over - fitting. Specifically, the paper introduces CatBoost, which is a new open - source gradient - boosting library. It can effectively handle categorical features and outperforms existing gradient - boosting implementation methods on multiple public datasets. ### Main Contributions: 1. **Effectively Handling Categorical Features**: Most existing gradient - boosting algorithms need to convert categorical features into numerical features when handling them, which may lead to information loss or over - fitting. CatBoost avoids these problems by directly handling categorical features during the training process. 2. **Reducing Over - fitting**: CatBoost introduces a new leaf - node - value calculation scheme, which helps to reduce over - fitting. In addition, it uses a technique called "unbiased gradient estimation", which reduces over - fitting by not using the gradient of the current sample during the training process. 3. **High - Performance Implementation**: CatBoost provides GPU and CPU versions of the implementation. Among them, the training speed of the GPU version is significantly faster than other popular gradient - boosting libraries (such as XGBoost and LightGBM), especially when dealing with large - scale datasets. The scoring speed of the CPU version is also faster than other libraries. ### Specific Technical Details: - **Categorical Feature Handling**: - **Statistical Calculation**: For each categorical feature, CatBoost converts the categorical feature by calculating the average value of the label values. To avoid over - fitting, it uses a randomly permuted trick, that is, when calculating the feature value of a certain sample, only the samples located before it in the permutation are used. - **Feature Combination**: CatBoost also considers the combination between categorical features to generate stronger features. - **Unbiased Gradient Estimation**: - To reduce over - fitting, CatBoost uses unbiased gradient estimation when selecting the tree structure. Specifically, for each sample \(X_k\), a separate model \(M_k\) is trained, and this model has never used the gradient estimate of \(X_k\). - **High - Performance Implementation**: - **GPU Acceleration**: The GPU implementation of CatBoost significantly improves the training speed by optimizing memory usage and parallel computing. - **Fast Scoring**: CatBoost uses oblivious trees as base predictors and achieves efficient scoring through binary feature encoding and parallel computing. ### Experimental Results: The paper conducted experiments on multiple public datasets, and the results show that CatBoost has better performance in classification tasks than existing methods such as XGBoost, LightGBM, and H2O. Especially in terms of training speed on the GPU, CatBoost shows a significant advantage. In conclusion, this paper solves the challenges of the gradient - boosting algorithm in handling categorical features and reducing over - fitting by introducing new techniques and optimization methods, and provides a high - performance implementation.