Abstract:It is often said that the fundamental problem of causal inference is a missing data problem -- the comparison of responses to two hypothetical treatment assignments is made difficult because for every experimental unit only one potential response is observed. In this paper, we consider the implications of the converse view: that missing data problems are a form of causal inference. We make explicit how the missing data problem of recovering the complete data law from the observed law can be viewed as identification of a joint distribution over counterfactual variables corresponding to values had we (possibly contrary to fact) been able to observe them. Drawing analogies with causal inference, we show how identification assumptions in missing data can be encoded in terms of graphical models defined over counterfactual and observed variables. We review recent results in missing data identification from this viewpoint. In doing so, we note interesting similarities and differences between missing data and causal identification theories.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to regard the missing data problem as a form of causal inference and explore methods for identifying the complete data distribution from the observed data distribution. Specifically, the paper focuses on the model identification problem under the Missing Not At Random (MNAR) mechanism and explores this problem through counterfactual variables and causal graph models. ### Main research questions: 1. **The relationship between missing data and causal inference**: The paper explores the analogical relationship between the missing data problem and causal inference, pointing out that there are similarities in terms of terminology, identification theory, and statistical inference. In particular, the paper regards the missing data problem as a causal inference problem, where the object of intervention is the missing indicator variable. 2. **Identification of the Missing Not At Random (MNAR) model**: The paper discusses how to identify the target parameters by imposing independence restrictions under the MNAR mechanism. These independence restrictions can be represented by Directed Acyclic Graphs (DAGs), thereby transforming the complex identification problem into a path analysis problem in the graph model. 3. **The application of graph models in missing data identification**: The paper introduces how to use DAGs to encode the independence restrictions in the complete data distribution and shows how to identify the target parameters through these graph models. In particular, the paper discusses how to achieve non - parametric identification in the MNAR model through graph models without making parametric assumptions about the complete data distribution. ### Main contributions of the paper: - **Theoretical framework**: The paper provides a theoretical framework that combines the missing data problem with the causal inference problem and redefines the classical missing data model through counterfactual variables and causal graph models. - **Identification methods**: The paper proposes several new techniques for non - parametric identification in the missing data DAG model, which can handle complex scenarios including unobserved confounding factors. - **Practical applications**: The paper discusses the potential value of these theories and methods in practical applications, especially the advantages in dealing with complex missing data problems. ### Key concepts and formulas: - **Counterfactual variable**: Denote by \( L(r_k = 1)_k \) the value of the variable \( Z_k \) if \( R_k \) is intervened to be 1. - **Target law**: \( p(l^{(1)}) \), representing the joint distribution of complete data. - **Missing mechanism**: \( p(r|l^{(1)}) \), representing the conditional distribution of the missing indicator variable \( R \). - **Full law**: \( p(l^{(1)}, r) \), representing the joint distribution of complete data and the missing indicator variable. - **g - formula**: \( p(l(a))=\frac{p(l, a)}{p(a|l(a))} \), used for identifying causal effects. Through these theories and methods, the paper provides new perspectives and tools for dealing with complex missing data problems.

Causal and counterfactual views of missing data models

Causal Inference: A Missing Data Perspective

Causal modelling without introducing counterfactuals or abstract distributions

Neural Causal Models for Counterfactual Identification and Estimation

Graphical Models of Entangled Missingness

Causal Inference with Unmeasured Confounding from Nonignorable Missing Outcomes

Rethinking the framework constructed by counterfactual functional model

Estimating complex causal effects from incomplete observational data

Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks

Reinterpreting causal discovery as the task of predicting unobserved joint statistics

Nonparametric causal inference with confounders missing not at random

What can be estimated? Identifiability, estimability, causal inference and ill-posed inverse problems

Causal Razors

A Quantum Probability Model of Causal Reasoning

Causal models on probability spaces

Estimating Categorical Counterfactuals via Deep Twin Networks

Nondeterministic Causal Models

The Impact of Missing Data on Causal Discovery: A Multicentric Clinical Study

Omitted Labels in Causality: A Study of Paradoxes

Causal Discovery in Linear Models with Unobserved Variables and Measurement Error

Causal Inference with Non-IID Data under Model Uncertainty