Causal discovery for observational sciences using supervised machine learning

Anne Helby Petersen,Joseph Ramsey,Claus Thorn Ekstrøm,Peter Spirtes

DOI: https://doi.org/10.48550/arXiv.2202.12813

2022-05-14

Abstract:Causal inference can estimate causal effects, but unless data are collected experimentally, statistical analyses must rely on pre-specified causal models. Causal discovery algorithms are empirical methods for constructing such causal models from data. Several asymptotically correct methods already exist, but they generally struggle on smaller samples. Moreover, most methods focus on very sparse causal models, which may not always be a realistic representation of real-life data generating mechanisms. Finally, while causal relationships suggested by the methods often hold true, their claims about causal non-relatedness have high error rates. This non-conservative error tradeoff is not ideal for observational sciences, where the resulting model is directly used to inform causal inference: A causal model with many missing causal relations entails too strong assumptions and may lead to biased effect estimates. We propose a new causal discovery method that addresses these three shortcomings: Supervised learning discovery (SLdisco). SLdisco uses supervised machine learning to obtain a mapping from observational data to equivalence classes of causal models. We evaluate SLdisco in a large simulation study based on Gaussian data and we consider several choices of model size and sample size. We find that SLdisco is more conservative, only moderately less informative and less sensitive towards sample size than existing procedures. We furthermore provide a real epidemiological data application. We use random subsampling to investigate real data performance on small samples and again find that SLdisco is less sensitive towards sample size and hence seems to better utilize the information available in small datasets.

Methodology,Machine Learning

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on the deficiencies of existing causal discovery methods when dealing with small - sample data and dense graph structures. Specifically: 1. **Poor performance with small samples**: When dealing with small - sample data, existing causal discovery algorithms lead to unsatisfactory results due to the propagation of statistical errors. These algorithms usually need to conduct a large number of sequential tests, and the result of each test will affect subsequent tests. Therefore, even a small statistical error in the early tests may have a significant impact on the final result. 2. **Bias towards sparse graphs**: Most existing causal discovery algorithms assume that causal relationships are relatively sparse, that is, there are fewer causal connections between variables. However, in many practical applications, this assumption may not hold, especially in fields such as epidemiology, where the causal relationships between variables are often very complex and dense. Therefore, these algorithms perform poorly when dealing with dense graph structures. 3. **Error - weighing problem**: Existing causal discovery methods focus more on correctly identifying existing causal relationships and ignore the importance of correctly identifying non - existing causal relationships. In observational scientific research, such as epidemiology, wrongly assuming that certain causal relationships do not exist may lead to the use of inappropriate statistical methods, resulting in biased estimates of causal effects. To solve the above problems, the paper proposes a new causal discovery method - Supervised Learning Discovery (SLdisco). SLdisco uses supervised machine learning to infer the equivalence class (CPDAG) of causal models from observational data, aiming to improve the robustness of small - sample data, avoid the preference for sparse graphs, and provide a method to directly control different types of errors. Through this method, SLdisco can show better performance on small - sample and dense graph structures, and be more conservative in identifying non - existing causal relationships, which is more suitable for the needs of observational scientific research.

Causal discovery for observational sciences using supervised machine learning

An Introduction to Causal Discovery

Learning domain-specific causal discovery from time series

A Novel Causal Discovery Method in Linear SEM with Structure Priors

Reinterpreting causal discovery as the task of predicting unobserved joint statistics

Confidence in Causal Discovery with Linear Causal Models

Causal Discovery from Heterogeneous/Nonstationary Data with Independent Changes

An Automated Approach to Causal Inference in Discrete Settings

Comparative Study of Causal Discovery Methods for Cyclic Models with Hidden Confounders

The Landscape of Causal Discovery Data: Grounding Causal Discovery in Real-World Applications

Stable Differentiable Causal Discovery

Discovering Causal Models with Optimization: Confounders, Cycles, and Instrument Validity

Causal Discovery under Latent Class Confounding

Causal Discovery from Multiple Data Sets with Non-Identical Variable Sets

Causality on cross-sectional data: Stable specification search in constrained structural equation modeling

Causality on Longitudinal Data: Stable Specification Search in Constrained Structural Equation Modeling

Scalable Causal Structure Learning: Scoping Review of Traditional and Deep Learning Algorithms and New Opportunities in Biomedicine

A survey of causal discovery based on functional causal model

A Constraint-Based Algorithm For Causal Discovery with Cycles, Latent Variables and Selection Bias

Methods and tools for causal discovery and causal inference

Bivariate Causal Discovery using Bayesian Model Selection