Causal discovery for observational sciences using supervised machine learning

Anne Helby Petersen,Joseph Ramsey,Claus Thorn Ekstrøm,Peter Spirtes
DOI: https://doi.org/10.48550/arXiv.2202.12813
2022-05-14
Abstract:Causal inference can estimate causal effects, but unless data are collected experimentally, statistical analyses must rely on pre-specified causal models. Causal discovery algorithms are empirical methods for constructing such causal models from data. Several asymptotically correct methods already exist, but they generally struggle on smaller samples. Moreover, most methods focus on very sparse causal models, which may not always be a realistic representation of real-life data generating mechanisms. Finally, while causal relationships suggested by the methods often hold true, their claims about causal non-relatedness have high error rates. This non-conservative error tradeoff is not ideal for observational sciences, where the resulting model is directly used to inform causal inference: A causal model with many missing causal relations entails too strong assumptions and may lead to biased effect estimates. We propose a new causal discovery method that addresses these three shortcomings: Supervised learning discovery (SLdisco). SLdisco uses supervised machine learning to obtain a mapping from observational data to equivalence classes of causal models. We evaluate SLdisco in a large simulation study based on Gaussian data and we consider several choices of model size and sample size. We find that SLdisco is more conservative, only moderately less informative and less sensitive towards sample size than existing procedures. We furthermore provide a real epidemiological data application. We use random subsampling to investigate real data performance on small samples and again find that SLdisco is less sensitive towards sample size and hence seems to better utilize the information available in small datasets.
Methodology,Machine Learning
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the deficiencies of existing causal discovery methods when dealing with small - sample data and dense graph structures. Specifically: 1. **Poor performance with small samples**: When dealing with small - sample data, existing causal discovery algorithms lead to unsatisfactory results due to the propagation of statistical errors. These algorithms usually need to conduct a large number of sequential tests, and the result of each test will affect subsequent tests. Therefore, even a small statistical error in the early tests may have a significant impact on the final result. 2. **Bias towards sparse graphs**: Most existing causal discovery algorithms assume that causal relationships are relatively sparse, that is, there are fewer causal connections between variables. However, in many practical applications, this assumption may not hold, especially in fields such as epidemiology, where the causal relationships between variables are often very complex and dense. Therefore, these algorithms perform poorly when dealing with dense graph structures. 3. **Error - weighing problem**: Existing causal discovery methods focus more on correctly identifying existing causal relationships and ignore the importance of correctly identifying non - existing causal relationships. In observational scientific research, such as epidemiology, wrongly assuming that certain causal relationships do not exist may lead to the use of inappropriate statistical methods, resulting in biased estimates of causal effects. To solve the above problems, the paper proposes a new causal discovery method - Supervised Learning Discovery (SLdisco). SLdisco uses supervised machine learning to infer the equivalence class (CPDAG) of causal models from observational data, aiming to improve the robustness of small - sample data, avoid the preference for sparse graphs, and provide a method to directly control different types of errors. Through this method, SLdisco can show better performance on small - sample and dense graph structures, and be more conservative in identifying non - existing causal relationships, which is more suitable for the needs of observational scientific research.