UTBoost: Gradient Boosted Decision Trees for Uplift Modeling

Junjie Gao,Xiangyu Zheng,DongDong Wang,Zhixiang Huang,Bangqi Zheng,Kai Yang
2024-12-03
Abstract:Uplift modeling comprises a collection of machine learning techniques designed for managers to predict the incremental impact of specific actions on customer outcomes. However, accurately estimating this incremental impact poses significant challenges due to the necessity of determining the difference between two mutually exclusive outcomes for each individual. In our study, we introduce two novel modifications to the established Gradient Boosting Decision Trees (GBDT) technique. These modifications sequentially learn the causal effect, addressing the counterfactual dilemma. Each modification innovates upon the existing technique in terms of the ensemble learning method and the learning objective, respectively. Experiments with large-scale datasets validate the effectiveness of our methods, consistently achieving substantial improvements over baseline models.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges in uplift modeling, specifically how to accurately estimate the incremental impact of specific actions on customer outcomes. The core of this problem lies in the need to determine the different outcomes of each individual in two mutually exclusive situations, which is impossible to directly observe in practice. To solve this problem, the author proposes two novel improvement methods to enhance the existing Gradient Boosting Decision Tree (GBDT) technology. These improvement methods solve the counterfactual problem (i.e., being unable to observe the results of an individual when receiving and not receiving treatment simultaneously) by serializing the learning of causal effects. The following are the main contributions of the paper: 1. **Proposing a new boosting tree method**: The author extends the traditional bagging method to the boosting method to maximize the heterogeneity of causal effects. This method performs particularly well on high - dimensional data sets. 2. **Integrating potential outcomes and causal effects**: For the first time, the joint optimization of potential outcomes and causal effects is introduced into the classical GBDT framework, and a second - order method is used to fit multi - objective functions. This significantly reduces the computational complexity of the algorithm. 3. **Experimental verification**: Through extensive experiments on four large - scale real - world data sets and public data sets, it is proved that the proposed model is superior to the baseline methods and shows better robustness. The paper also details how to estimate treatment effects through gradient - boosting decision trees (such as TDDP and CausalGBM) and discusses the performance of these methods on different data sets. In particular, CausalGBM shows excellent robustness and accuracy on multiple data sets, while TDDP needs to be combined with some regularization methods to prevent overfitting. ### Formula Summary - **Uplift Definition**: \[ \tau_i = y_i(1) - y_i(0) \] where \(y_i(1)\) and \(y_i(0)\) represent the potential outcomes of individual \(i\) when receiving and not receiving treatment, respectively. - **Conditional Average Treatment Effect (CATE)**: \[ \tau(x) = E[y \mid w = 1, X = x] - E[y \mid w = 0, X = x] \] - **Loss Function**: \[ L(\tau(x), u_m(x))=\frac{1}{2n}\left\{E[y \mid X = x, w = 1]-E[y \mid X = x, w = 0]-u_m(x)\right\}^2 \] - **Optimal Splitting Criterion**: \[ s^*=\arg\max_s\left\{\frac{n_L n_R}{n}(\bar{\tau}_L - \bar{\tau}_R)^2\right\} \] These formulas help to understand the key concepts in uplift modeling and the algorithm optimization process.