Causal Structural Hypothesis Testing and Data Generation Models

Jeffrey Jiang,Omead Pooladzandi,Sunay Bhat,Gregory Pottie
DOI: https://doi.org/10.48550/arXiv.2210.11275
2022-11-05
Abstract:A vast amount of expert and domain knowledge is captured by causal structural priors, yet there has been little research on testing such priors for generalization and data synthesis purposes. We propose a novel model architecture, Causal Structural Hypothesis Testing, that can use nonparametric, structural causal knowledge and approximate a causal model's functional relationships using deep neural networks. We use these architectures for comparing structural priors, akin to hypothesis testing, using a deliberate (non-random) split of training and testing data. Extensive simulations demonstrate the effectiveness of out-of-distribution generalization error as a proxy for causal structural prior hypothesis testing and offers a statistical baseline for interpreting results. We show that the variational version of the architecture, Causal Structural Variational Hypothesis Testing can improve performance in low SNR regimes. Due to the simplicity and low parameter count of the models, practitioners can test and compare structural prior hypotheses on small dataset and use the priors with the best generalization capacity to synthesize much larger, causally-informed datasets. Finally, we validate our methods on a synthetic pendulum dataset, and show a use-case on a real-world trauma surgery ground-level falls dataset.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively use prior knowledge of causal structures for hypothesis testing and generate synthetic data in causal structure hypothesis testing and data - generating models. Specifically, the author focuses on how, in the absence of a large amount of data, to use non - parametric prior knowledge of causal structures, use deep neural networks to approximate the functional relationships of causal models, and compare different causal structure hypotheses through these architectures, so as to select the best causal model for data generation and generalization ability testing. ### Main problems 1. **Causal structure hypothesis testing**: How to test and compare different causal structure hypotheses on a limited data set to determine which hypothesis is closer to the real causal structure. 2. **Data generation**: How to use the selected best causal model to generate synthetic data that can better reflect the causal relationships in the real world, especially in the out - of - distribution (OOD) case. ### Solutions The author proposes two model architectures: - **Causal Structural Hypothesis Testing (CSHT)**: A causal structure model for hypothesis testing, which evaluates the generalization ability of different causal hypotheses by non - randomly splitting training and test data. - **Causal Structural Variational Hypothesis Testing (CSVHT)**: A variational model is introduced on the basis of CSHT, which improves the performance under low signal - to - noise ratio (SNR) conditions and can generate more dynamic synthetic data. ### Key methods - **Structural Causal Model (SCM)**: Use a binary Structural Causal Model (SCM) to represent the causal relationships between variables. - **Deep neural network**: Used to approximate the functional relationships in the causal model, and the model is trained by minimizing the Mean Squared Error (MSE). - **Non - random data splitting**: Evaluate the generalization ability of the model by deliberately splitting the training and test data into non - overlapping distributions. ### Experimental verification - **Simulated DAG experiment**: Verify the performance of CSHT and CSVHT under different hypotheses by simulating DAGs of different sizes and numbers of edges. - **Physical pendulum data set**: Use the synthetic pendulum data set to verify the performance of the model under a known causal structure. - **Medical trauma data set**: Use a real - world data set to verify the performance of the model under an unknown causal structure. ### Conclusions - **Generalization ability**: CSHT and CSVHT show stronger generalization ability in out - of - distribution (OOD) tests, especially on small data sets. - **Hypothesis testing**: By using the OOD loss as a proxy indicator, different causal hypotheses can be effectively distinguished, especially those hypotheses with missing paths. - **Data generation**: CSVHT can generate high - quality synthetic data under low signal - to - noise ratio conditions, which is helpful for data augmentation and model training. ### Formulas - **Structural causal matrix**: \[ S = A_{\text{DAG}}+D_{\text{diag}} \] where \( A_{\text{DAG}} \) is the DAG adjacency matrix, and \( D_{\text{diag}} \) is a diagonal matrix. A value of 1 on the diagonal indicates an exogenous variable, and 0 indicates an endogenous variable. - **Reconstruction loss**: \[ \ell_{\text{CSHT}}=\|\mathbf{x}-\eta_i(S_i\circ\mathbf{x})\|^2_2 \] where \( \eta_i \) is a fully - connected neural network used to approximate the relationship between parent and child nodes. - **Structural Hamming distance**: \[ H = \|A_i - A_j\|_1 \] \[ H^+=\|A_1 > A_0\|_1 \] \[ H^-=\|A_1 < A_0\|_1 \] Through these methods and experiments, the paper demonstrates the effectiveness and potential of CSHT and CSVHT in causal hypothesis testing and data generation.