The Landscape of Causal Discovery Data: Grounding Causal Discovery in Real-World Applications

Philippe Brouillard,Chandler Squires,Jonas Wahl,Konrad P. Kording,Karen Sachs,Alexandre Drouin,Dhanya Sridhar
2024-12-03
Abstract:Causal discovery aims to automatically uncover causal relationships from data, a capability with significant potential across many scientific disciplines. However, its real-world applications remain limited. Current methods often rely on unrealistic assumptions and are evaluated only on simple synthetic toy datasets, often with inadequate evaluation metrics. In this paper, we substantiate these claims by performing a systematic review of the recent causal discovery literature. We present applications in biology, neuroscience, and Earth sciences - fields where causal discovery holds promise for addressing key challenges. We highlight available simulated and real-world datasets from these domains and discuss common assumption violations that have spurred the development of new methods. Our goal is to encourage the community to adopt better evaluation practices by utilizing realistic datasets and more adequate metrics.
Machine Learning,Methodology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of causal discovery in practical applications. Specifically, current causal discovery methods often rely on unrealistic assumptions and are mainly evaluated on simple synthetic datasets, using insufficient evaluation metrics. These methods are limited in real - world applications because they fail to fully consider the complexity and diversity of real - world data. By systematically reviewing recent causal discovery literature, the paper points out the deficiencies of these methods in evaluation and application, and emphasizes the importance of using more realistic evaluation datasets and more appropriate evaluation metrics. ### Main Objectives of the Paper: 1. **Evaluate the Current Situation**: Through systematic review, show that current causal discovery research still mainly relies on synthetic datasets, the types of datasets used are single, and the evaluation metrics are also insufficient. 2. **Propose Improvements**: Encourage the research community to adopt datasets and evaluation metrics that are closer to practical applications, so as to promote the application of causal discovery methods in the real world. 3. **Identify Application Areas**: Focus on introducing fields such as biology, neuroscience, and earth science. These fields generate a large amount of real - world data and are suitable for application in causal discovery research. ### Main Contributions: - **Systematic Review**: Conducted a systematic review of the literature in the causal discovery field in recent years, revealing the shortcomings of existing methods. - **Dataset Analysis**: Analyzed in detail different types of causal discovery datasets, including synthetic datasets, pseudo - real datasets, and real - world datasets, and pointed out their advantages and limitations. - **Application Cases**: Provided specific datasets and application cases in the fields of biology, neuroscience, and earth science, showing the potential and challenges of these fields in causal discovery. ### Key Issues: - **Limitations of Synthetic Datasets**: Although synthetic datasets are convenient for controlling variables and verifying hypotheses, there is a large gap between them and real - world data, resulting in evaluation results that may be overly optimistic. - **Challenges of Real - World Datasets**: Real - world datasets usually lack known causal structures, and the evaluation methods are more complex, but they can better reflect practical application scenarios. - **Selection of Evaluation Metrics**: Existing evaluation metrics mainly focus on structural metrics, while ignoring intervention metrics, which can better reflect the needs in practical applications. ### Conclusion: The paper calls on the causal discovery research community to pay more attention to practical applications, adopt more diverse datasets and more comprehensive evaluation metrics, so as to promote the effective application of causal discovery methods in the real world.