Empirical Analysis of Model Selection for Heterogeneous Causal Effect Estimation

Divyat Mahajan,Ioannis Mitliagkas,Brady Neal,Vasilis Syrgkanis
2024-04-29
Abstract:We study the problem of model selection in causal inference, specifically for conditional average treatment effect (CATE) estimation. Unlike machine learning, there is no perfect analogue of cross-validation for model selection as we do not observe the counterfactual potential outcomes. Towards this, a variety of surrogate metrics have been proposed for CATE model selection that use only observed data. However, we do not have a good understanding regarding their effectiveness due to limited comparisons in prior studies. We conduct an extensive empirical analysis to benchmark the surrogate model selection metrics introduced in the literature, as well as the novel ones introduced in this work. We ensure a fair comparison by tuning the hyperparameters associated with these metrics via AutoML, and provide more detailed trends by incorporating realistic datasets via generative modeling. Our analysis suggests novel model selection strategies based on careful hyperparameter selection of CATE estimators and causal ensembling.
Machine Learning,Artificial Intelligence,Methodology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the model selection problem in causal inference, especially for the Conditional Average Treatment Effect (CATE) estimation. Specifically, since we cannot directly observe counterfactual results (i.e., the potential results of individuals under different intervention measures) in causal inference, the traditional cross - validation method is not applicable. For this reason, researchers have proposed a variety of surrogate metrics to evaluate and select CATE models. However, the effectiveness of these surrogate metrics has not been fully verified because previous studies lack extensive comparisons. ### Research Background and Motivation 1. **Personalized Intervention Effects**: Many decision - making tasks require calculating the personalized impact of interventions on individuals. If interventions are assigned only based on the average effect, sub - optimal results may be obtained because the heterogeneity of data is not considered. Therefore, identifying which individuals benefit the most from a certain intervention can lead to better policy - making. 2. **Existing Technologies**: To estimate Heterogeneous Treatment Effects (HTE), a variety of techniques have been developed, including adaptive neural networks, random forests, double - machine - learning frameworks, instrumental variables, meta - learners, etc. But how to choose among these estimators remains a challenge. 3. **Surrogate Metrics**: To solve this problem, researchers have proposed some surrogate metrics that use only observational data for model selection. Early surrogate metrics were mainly based on evaluating nuisance models related to estimators and the utility of decision - making strategies based on heterogeneous treatment effects. Recent research has turned to designing surrogate metrics closer to the real effect and calculating the deviation between them and the treatment effects of estimators. ### Main Contributions of the Paper 1. **Comprehensive Empirical Analysis**: The authors conducted extensive benchmarking of 34 surrogate metrics on 78 datasets and trained a large number (415) of CATE estimators. Hyper - parameters were automatically adjusted by AutoML to ensure fair comparison. 2. **Two - Layer Model Selection Strategy**: A new two - layer model selection strategy was proposed. First, the optimal hyper - parameters were selected for each meta - learner class, and then surrogate metrics were used for selection among the remaining meta - learners. This method significantly improved the performance of multiple surrogate metrics. 3. **Generative Modeling**: The latest generative modeling techniques (such as RealCause) were used to introduce realistic benchmarks to ensure the rationality of experimental data. 4. **New Surrogate Metrics**: Several new surrogate metrics were introduced, inspired by fields such as TMLE, policy learning, calibration, and boosting modeling. ### Summary of Results - **Globally Dominant Metrics**: Doubly Robust and TMLE variants perform best globally, outperforming other metrics. - **Plugin Surrogate Metrics**: Plugin surrogate metrics (such as T/X Score) are hardly surpassed by other metrics in all datasets, highlighting the importance of using AutoML to learn nuisance models. - **Advantages of T - Learner Strategy**: Metrics based on T - Learner perform better than those based on S - Learner, further demonstrating the importance of choosing appropriate nuisance models for the performance of surrogate metrics. Through these studies, the authors provide valuable insights into model selection in causal inference and demonstrate the effectiveness of certain surrogate metrics in practical applications.