To Impute or not to Impute? Missing Data in Treatment Effect Estimation

Jeroen Berrevoets,Fergus Imrie,Trent Kyono,James Jordon,Mihaela van der Schaar
DOI: https://doi.org/10.48550/arXiv.2202.02096
2023-02-24
Abstract:Missing data is a systemic problem in practical scenarios that causes noise and bias when estimating treatment effects. This makes treatment effect estimation from data with missingness a particularly tricky endeavour. A key reason for this is that standard assumptions on missingness are rendered insufficient due to the presence of an additional variable, treatment, besides the input (e.g. an individual) and the label (e.g. an outcome). The treatment variable introduces additional complexity with respect to why some variables are missing that is not fully explored by previous work. In our work we introduce mixed confounded missingness (MCM), a new missingness mechanism where some missingness determines treatment selection and other missingness is determined by treatment selection. Given MCM, we show that naively imputing all data leads to poor performing treatment effects models, as the act of imputation effectively removes information necessary to provide unbiased estimates. However, no imputation at all also leads to biased estimates, as missingness determined by treatment introduces bias in covariates. Our solution is selective imputation, where we use insights from MCM to inform precisely which variables should be imputed and which should not. We empirically demonstrate how various learners benefit from selective imputation compared to other solutions for missing data. We highlight that our experiments encompass both average treatment effects and conditional average treatment effects.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to accurately estimate the treatment effect in the presence of missing data. Specifically, the paper focuses on the fact that in the treatment effect estimation, due to the complex interaction between treatment selection and the missingness of certain variables, standard missing - data handling methods (such as complete imputation or no imputation) cannot provide unbiased estimation results. These problems include: 1. **The influence of missing data on treatment selection**: The missingness of certain variables may affect the choice of treatment, thereby introducing selection bias. For example, in a medical scenario, doctors may decide whether to adopt a more aggressive treatment plan based on the completeness of the information provided by patients. 2. **The influence of treatment selection on data missingness**: The choice of treatment may also lead to the missingness of certain variables. For example, some drugs require baseline blood tests before treatment, and if a different treatment plan is selected, these tests may not be carried out. 3. **The deficiencies of existing missing - data mechanisms**: Existing missing - data mechanisms (such as MCAR, MAR, MNAR) do not fully consider the existence of treatment variables and their influence on the missing pattern. This makes these mechanisms perform poorly in treatment effect estimation. To solve the above problems, the paper introduces a new missing - data mechanism - Mixed Confounded Missingness (MCM). The MCM mechanism allows treatment to either cause the missingness of certain variables or be determined by the missingness of certain variables. Based on this mechanism, the paper proposes a selective imputation strategy, that is, only imputing those variables affected by treatment selection while retaining the missing status of those variables that affect treatment selection. In this way, necessary information can be retained while reducing bias. Through this method, the paper aims to provide a more accurate and reliable method for handling the treatment - effect - estimation problem in the presence of missing data.