Causal inference for multiple risk factors and diseases from genomics data

Nick Machnik,Mahdi Mahmoudi,Malgorzata Borczyk,Ilse Kraetschmer,Markus J Bauer,Matthew R Robinson
DOI: https://doi.org/10.1101/2023.12.06.570392
2024-08-10
Abstract:Statistical causal learning in genomics relies on the instrumental variable method of Mendelian Randomization (MR). Currently, an overwhelming number of MR studies purport to show causal relationships among a wide range of risk factors and outcomes. Here, we show that selecting instrument variables from genome-wide association study estimates leads to high false discovery rates for many MR approaches, which can be greatly reduced by employing a graphical inference approach which: (i) explicitly tests instrumental variable assumptions; (ii) distinguishes direct from indirect factors in very high-dimensional data; (iii) discriminates pleiotropic from trait-specific markers, con- trolling for LD genome-wide; (iv) accommodates rare variants and binary outcomes in a principled way; and (v) identifies potential unobserved latent confounding. For 17 traits and 8.4M variants recorded for 458,747 individuals in the UK Biobank, we show that standard MR analysis gives an abundance of findings that disappear under stringent assumption checks, with many relationships reflecting potential unmeasured confound- ing. This implies that mixtures of temporal precedence and potential for reverse-causality prohibit understanding the underlying nature of phenotypic and genetic correlations in biobank data. We propose that well-curated longitudinal records are likely needed and that our approach provides a first-step toward robust principled screening for potential causal links.
Genetics
What problem does this paper attempt to address?
The paper attempts to address the following issues: 1. **High False Positive Rate**: When selecting instrumental variables (IV) from genome-wide association study (GWAS) data, existing Mendelian Randomization (MR) methods lead to a high false positive rate. The paper points out that many MR studies claim to have discovered widespread causal relationships, but these results often disappear under rigorous hypothesis testing. 2. **Complex Causal Relationship Identification**: Existing methods find it difficult to distinguish between direct and indirect causal factors and to handle highly correlated risk factors and genetic markers. Additionally, performing statistical causal discovery in large-scale genomic data is very challenging due to the presence of numerous potential risk factors, outcomes, and millions of genetic markers. 3. **Potential Confounding Factors**: Many existing methods fail to effectively identify and control for potential confounding factors, leading to causal inference results that may be influenced by unmeasured confounders. To address these issues, the authors propose a new method—CI-GWAS (Causal Inference for Genome-Wide Association Studies), which can construct large-scale graphical models to describe the causal relationships between genetic markers, risk factors, and disease outcomes. Through graphical inference, this method can better select effective instrumental variables that satisfy MR assumptions, thereby improving the accuracy and reliability of causal inference.