Causal Discovery over High-Dimensional Structured Hypothesis Spaces with Causal Graph Partitioning

Ashka Shah,Adela DePavia,Nathaniel Hudson,Ian Foster,Rick Stevens
2024-07-25
Abstract:The aim in many sciences is to understand the mechanisms that underlie the observed distribution of variables, starting from a set of initial hypotheses. Causal discovery allows us to infer mechanisms as sets of cause and effect relationships in a generalized way -- without necessarily tailoring to a specific domain. Causal discovery algorithms search over a structured hypothesis space, defined by the set of directed acyclic graphs, to find the graph that best explains the data. For high-dimensional problems, however, this search becomes intractable and scalable algorithms for causal discovery are needed to bridge the gap. In this paper, we define a novel causal graph partition that allows for divide-and-conquer causal discovery with theoretical guarantees. We leverage the idea of a superstructure -- a set of learned or existing candidate hypotheses -- to partition the search space. We prove under certain assumptions that learning with a causal graph partition always yields the Markov Equivalence Class of the true causal graph. We show our algorithm achieves comparable accuracy and a faster time to solution for biologically-tuned synthetic networks and networks up to ${10^4}$ variables. This makes our method applicable to gene regulatory network inference and other domains with high-dimensional structured hypothesis spaces.
Machine Learning,Distributed, Parallel, and Cluster Computing,Methodology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenge of causal discovery in high - dimensional structural hypothesis spaces. Specifically, the paper focuses on how to effectively infer the causal relationship network behind the data when the number of variables is very large. Traditional causal discovery algorithms encounter the problem of high computational complexity when dealing with high - dimensional problems, making the search for causal graphs infeasible. Therefore, the paper proposes a new method - Causal Graph Partitioning - to solve this problem. ### Main contributions of the paper 1. **Defined causal partitioning**: - Proposed a new causal graph partitioning method, which can decompose high - dimensional problems into multiple smaller sub - problems for processing. - This partitioning method utilizes a superstructure, that is, dividing the variable set into overlapping subsets, thereby implementing a divide - and - conquer strategy. 2. **Theoretical guarantees**: - Proved that under certain assumptions, learning with causal partitioning can always obtain the Markov Equivalence Class (MEC) of the true causal graph. - This means that even with an infinite amount of data, this method can find a graph equivalent to the true causal graph. 3. **Algorithm performance**: - The experimental results show that this algorithm has high accuracy and faster solution time on synthetic biological networks and is suitable for networks with up to 10,000 nodes. - This makes this method particularly suitable for the inference of gene regulatory networks and other problems in high - dimensional structural hypothesis spaces. ### Background and related work - **Causal discovery**: - The goal of causal discovery is to infer the causal relationships between variables from observational data. These relationships are usually represented by Directed Acyclic Graphs (DAGs), where nodes represent random variables and directed edges represent causal relationships. - Exploring the search space of causal graphs is an NP - hard problem, so efficient algorithms are required to solve it. - **Existing methods**: - Existing causal discovery algorithms are mainly divided into two categories: constraint - based methods and score - based methods. - Constraint - based methods determine the dependency relationships between nodes through conditional independence tests, while score - based methods select the best graph by optimizing the score function. - Some hybrid methods first use constraint - based methods to narrow the search space and then use score - based methods for local optimization. - **Divide - and - conquer methods**: - Existing divide - and - conquer methods divide the variable set into subsets, perform causal discovery on each subset respectively, and finally merge the results. - These methods usually have no theoretical guarantees, and the process of merging graphs may have high computational costs. ### Innovations of the paper - **Causal partitioning**: - By introducing a superstructure, the variable set is divided into overlapping subsets, allowing independent causal discovery on each subset. - This method avoids additional learning steps to merge subsets and improves computational efficiency. - It is proved that under certain assumptions, using causal partitioning can consistently estimate the Markov Equivalence Class of the true causal graph. ### Experimental results - **Influence of the number of samples**: - As the number of samples increases, the method using causal partitioning gradually converges to the Markov Equivalence Class of the true causal graph. - Even in algorithms that do not support latent variables (such as PC, GES, NOTEARS), the method using causal partitioning also shows good performance. - **Influence of superstructure density**: - Even if the superstructure contains a large number of redundant edges, the method using causal partitioning can still approach the Markov Equivalence Class of the true causal graph. - **Imperfect superstructure**: - When using the PC algorithm to estimate the superstructure, even if the superstructure is imperfect, the causal partitioning method still performs well. In conclusion, this paper proposes a new causal partitioning method, which can efficiently perform causal discovery in high - dimensional problems and provides theoretical guarantees. This provides new tools and methods for processing large - scale biological networks and other high - dimensional data sets.