Causal and counterfactual views of missing data models

Razieh Nabi,Rohit Bhattacharya,Ilya Shpitser,James Robins
2024-09-30
Abstract:It is often said that the fundamental problem of causal inference is a missing data problem -- the comparison of responses to two hypothetical treatment assignments is made difficult because for every experimental unit only one potential response is observed. In this paper, we consider the implications of the converse view: that missing data problems are a form of causal inference. We make explicit how the missing data problem of recovering the complete data law from the observed law can be viewed as identification of a joint distribution over counterfactual variables corresponding to values had we (possibly contrary to fact) been able to observe them. Drawing analogies with causal inference, we show how identification assumptions in missing data can be encoded in terms of graphical models defined over counterfactual and observed variables. We review recent results in missing data identification from this viewpoint. In doing so, we note interesting similarities and differences between missing data and causal identification theories.
Methodology,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to regard the missing data problem as a form of causal inference and explore methods for identifying the complete data distribution from the observed data distribution. Specifically, the paper focuses on the model identification problem under the Missing Not At Random (MNAR) mechanism and explores this problem through counterfactual variables and causal graph models. ### Main research questions: 1. **The relationship between missing data and causal inference**: The paper explores the analogical relationship between the missing data problem and causal inference, pointing out that there are similarities in terms of terminology, identification theory, and statistical inference. In particular, the paper regards the missing data problem as a causal inference problem, where the object of intervention is the missing indicator variable. 2. **Identification of the Missing Not At Random (MNAR) model**: The paper discusses how to identify the target parameters by imposing independence restrictions under the MNAR mechanism. These independence restrictions can be represented by Directed Acyclic Graphs (DAGs), thereby transforming the complex identification problem into a path analysis problem in the graph model. 3. **The application of graph models in missing data identification**: The paper introduces how to use DAGs to encode the independence restrictions in the complete data distribution and shows how to identify the target parameters through these graph models. In particular, the paper discusses how to achieve non - parametric identification in the MNAR model through graph models without making parametric assumptions about the complete data distribution. ### Main contributions of the paper: - **Theoretical framework**: The paper provides a theoretical framework that combines the missing data problem with the causal inference problem and redefines the classical missing data model through counterfactual variables and causal graph models. - **Identification methods**: The paper proposes several new techniques for non - parametric identification in the missing data DAG model, which can handle complex scenarios including unobserved confounding factors. - **Practical applications**: The paper discusses the potential value of these theories and methods in practical applications, especially the advantages in dealing with complex missing data problems. ### Key concepts and formulas: - **Counterfactual variable**: Denote by \( L(r_k = 1)_k \) the value of the variable \( Z_k \) if \( R_k \) is intervened to be 1. - **Target law**: \( p(l^{(1)}) \), representing the joint distribution of complete data. - **Missing mechanism**: \( p(r|l^{(1)}) \), representing the conditional distribution of the missing indicator variable \( R \). - **Full law**: \( p(l^{(1)}, r) \), representing the joint distribution of complete data and the missing indicator variable. - **g - formula**: \( p(l(a))=\frac{p(l, a)}{p(a|l(a))} \), used for identifying causal effects. Through these theories and methods, the paper provides new perspectives and tools for dealing with complex missing data problems.