Learning interpretable causal networks from very large datasets, application to 400,000 medical records of breast cancer patients

Marcel da Câmara Ribeiro-Dantas,Honghao Li,Vincent Cabeli,Louise Dupuis,Franck Simon,Liza Hettal,Anne-Sophie Hamy,Hervé Isambert
2023-03-11
Abstract:Discovering causal effects is at the core of scientific investigation but remains challenging when only observational data is available. In practice, causal networks are difficult to learn and interpret, and limited to relatively small datasets. We report a more reliable and scalable causal discovery method (iMIIC), based on a general mutual information supremum principle, which greatly improves the precision of inferred causal relations while distinguishing genuine causes from putative and latent causal effects. We showcase iMIIC on synthetic and real-life healthcare data from 396,179 breast cancer patients from the US Surveillance, Epidemiology, and End Results program. More than 90\% of predicted causal effects appear correct, while the remaining unexpected direct and indirect causal effects can be interpreted in terms of diagnostic procedures, therapeutic timing, patient preference or socio-economic disparity. iMIIC's unique capabilities open up new avenues to discover reliable and interpretable causal networks across a range of research fields.
Quantitative Methods,Machine Learning,Data Analysis, Statistics and Probability,Molecular Networks,Methodology
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is **reliably discovering causal relationships from large - scale observational data**. Specifically, the researchers have developed a new causal discovery method - iMIIC (interpretable Mutual Information - based Inductive Causation), aiming to overcome the limitations of existing causal discovery methods when dealing with large - scale data sets, especially to improve the reliability and interpretability of causal networks. ### Background of the Paper and Problem Description 1. **Importance of Causal Discovery**: - Discovering causal effects is the core of scientific research, but relying solely on observational data for causal discovery is extremely challenging. - Existing causal discovery methods are usually limited to relatively small data sets and are difficult to interpret. 2. **Limitations of Existing Methods**: - Most structural learning methods cannot distinguish between causal and non - causal relationships. - Constraint - based methods lack robustness when dealing with limited data sets and are prone to error accumulation. - Other methods are less reliable in predicting the direction of edges than in retaining edges, and cannot distinguish between "latent" and "true" causal relationships. 3. **Research Objectives**: - Develop a more reliable and scalable causal discovery method that can handle very large data sets (for example, data sets containing hundreds of thousands of samples). - Improve the accuracy of causal relationship inference while distinguishing between true causal relationships and latent or implicit causal effects. - Make the causal network more interpretable, especially in the application of healthcare data. ### Main Contributions of the iMIIC Method 1. **Improve the Reliability of Direction Prediction**: - Based on the general mutual information upper bound principle, iMIIC achieves higher direction prediction accuracy and reduces the proportion of false - positive directions. 2. **Distinguish between "True" and "Latent" Causal Relationships**: - iMIIC can distinguish between "true" causal relationships and "latent" causal relationships by evaluating the probability of each direction, thereby eliminating the influence of unobserved common causes. 3. **Ensure the Consistency of Indirect Paths**: - iMIIC ensures that during the process of removing redundant edges, the separation set is consistent with the finally inferred graph structure, which improves the interpretability of indirect effects. 4. **Ability to Handle Large - Scale Data Sets**: - iMIIC can handle large - scale data sets containing hundreds of thousands of samples and is applicable to multiple research fields. ### Application Examples The researchers used the iMIIC method to analyze a large - scale breast cancer patient data set from the SEER project of the National Cancer Institute in the United States, which contains the medical records of 396,179 patients. The results show that: - More than 90% of the predicted causal relationships are correct. - The remaining unexpected direct and indirect causal relationships can be explained as diagnostic procedures, treatment timing, patient preferences, or socioeconomic differences. ### Conclusion The iMIIC method performs excellently when dealing with large - scale observational data, can reliably discover and interpret causal relationships, and provides new tools and methods for research in medicine and other fields.