Accounting for Missing Covariates in Heterogeneous Treatment Estimation

Khurram Yamin,Vibhhu Sharma,Ed Kennedy,Bryan Wilder
2024-10-21
Abstract:Many applications of causal inference require using treatment effects estimated on a study population to make decisions in a separate target population. We consider the challenging setting where there are covariates that are observed in the target population that were not seen in the original study. Our goal is to estimate the tightest possible bounds on heterogeneous treatment effects conditioned on such newly observed covariates. We introduce a novel partial identification strategy based on ideas from ecological inference; the main idea is that estimates of conditional treatment effects for the full covariate set must marginalize correctly when restricted to only the covariates observed in both populations. Furthermore, we introduce a bias-corrected estimator for these bounds and prove that it enjoys fast convergence rates and statistical guarantees (e.g., asymptotic normality). Experimental results on both real and synthetic data demonstrate that our framework can produce bounds that are much tighter than would otherwise be possible.
Machine Learning,Methodology
What problem does this paper attempt to address?
### The problems the paper attempts to solve This paper attempts to solve the problem of how to use the estimated treatment effects in the study population to make decisions in the target population in causal inference, especially when there are covariates not observed in the study population in the target population. Specifically, the paper focuses on the problem of accurately identifying the Conditional Average Treatment Effect (CATE) when there are new covariates in the target population. ### Background and motivation In many applications of causal inference, researchers need to use the treatment effects estimated in one study population to make decisions in another target population. For example, a health system wishes to deploy a new intervention in its population, and the existing study data has been used to estimate heterogeneous treatment effects. However, due to different institutional settings, the health system will almost certainly have characteristics not measured in the study, such as social determinants (such as socioeconomic status, education level, etc.). These new covariates may have an important impact on the treatment effect, but there is no outcome data associated with them. Therefore, decision - makers hope to be able to judge whether the intervention is beneficial to patients based on all available information, not just the covariates included in the study. ### Research objectives The main objective of the paper is to provide the most accurate possible boundary estimates of the Conditional Average Treatment Effect (CATE) when there are new covariates in the target population. Specifically, the paper introduces a partial identification strategy, which is based on the idea of ecological inference and achieves this goal by ensuring the consistency of the fully conditional CATE when only considering common covariates. In addition, the paper also proposes a bias - corrected estimator and proves that this estimator has a fast convergence rate and statistical guarantees (such as asymptotic normality). ### Method overview 1. **Problem setting**: - **Study population**: Covariates \( V \) are observed, treatment assignment \( A\in\{0, 1\} \) and outcome \( Y \). - **Target population**: Only covariates \( V \) and new covariates \( W \) are observed, and treatment assignment or outcome is not observed. 2. **Partial identification boundaries**: - Through the idea of ecological inference, the marginal consistency of the fully conditional CATE when only considering common covariates is taken as a constraint. - A bias - corrected estimator is proposed, which can use non - parametric and/or slow - converging machine - learning models to estimate perturbation functions without sacrificing the fast convergence rate. 3. **Bias - corrected estimator**: - Through the sample - splitting strategy, the data set is divided into two parts, one part is used to estimate the perturbation function, and the other part is used to construct the bias - corrected estimator. - It is proved that when the perturbation function converges at a slower non - parametric rate, this estimator can still converge at a parametric rate. 4. **Sensitivity analysis model**: - A sensitivity analysis model is proposed, assuming that the new covariates \( W \) have a limited impact on the treatment effect, thereby providing tighter boundary estimates. ### Experimental results - **Simulation experiments**: It is shown that the bias - corrected estimator is more accurate than the plug - in estimator in most cases, especially when the error of the outcome regression model is large. - **Application of real RCT data**: The effectiveness and robustness of the method in actual data are verified. ### Conclusion By introducing the partial identification strategy and the bias - corrected estimator, the paper successfully solves the problem of estimating the Conditional Average Treatment Effect when there are new covariates in the target population. This method performs well in various situations and has high practical value.