Abstract:In this dissertation we concentrate on learning Bayesian Networks (BN) from distributed heterogeneous databases. We need to develop distributed techniques that save communication overhead, offer better scalability, and require minimal communication of possibly secure data. The objective of this work is to learn a collective BN from data that is distributed among geographically diverse sites. The data distribution is heterogeneous. The collective BN must be close to a BN learned by a centralized method and must require only a small amount of data transmission among different sites. In general, the collective learning algorithms have four steps: local learning, sample selection, cross learning, and combination. The key points in the proposed methods are: (1) use the BN decomposability property; (2) identify the samples that are most likely to be evidence of cross terms. We show that low-likelihood samples in each site are most likely to be the evidence of cross terms. One collective structure learning and two collective parameter learning methods are proposed. For structure learning, the collective method can find the correct structure of local variables by choosing a base structure learning algorithm with the decomposability property. Some extra links may be introduced due to the hidden variable problem. Sample selection chooses low-likelihood samples in local sites and transmits them to a central site. In cross learning, the structure of cross variables and cross set are identified. In combination, we add all cross links and remove extra local links. For parameter learning, Collective Method 1 (CM1) and Collective Method 2 (CM2) can learn a BN which is close to Bcntr using a small portion of samples. Local learning learns parameters for local variables. Cross learning learns the parameters of cross variables. The combination step aggregates the parameters of local variables and cross variables. In order to handle applications with real-time constraints, we have developed CM2. Using a notion of cross set, CM2 chooses a subset of features in a local site to do the likelihood computation and data selection. This can greatly reduce the local computation and the data transmission overhead. Experimental results demonstrate the efficiency and accuracy of these methods.

Parallel structural learning of Bayesian networks: Iterative divide and conquer algorithm based on structural fusion

A Ring-Based Distributed Algorithm for Learning High-Dimensional Bayesian Networks

Learning big Gaussian Bayesian networks: partition, estimation, and fusion

PEnBayes: A Multi-Layered Ensemble Approach for Learning Bayesian Network Structure from Big Data

Graph Structure Learning with Interpretable Bayesian Neural Networks

A Parallel Algorithm for Learning Bayesian Networks

Improved Population-Based Incremental Learning of Bayesian Networks with partly known structure and parallel computing

Collective Approach for Bayesian Network Learning from Distributed Heterogeneous Database

An efficient Bayesian network structure learning algorithm based on structural information

PSL: An Algorithm for Partial Bayesian Network Structure Learning

Learning massive interpretable gene regulatory networks of the human brain by merging Bayesian Networks

Learning Bayesian Networks Using a Parallel EM Approach

Partitioned hybrid learning of Bayesian network structures

VertiBayes: Learning Bayesian network parameters from vertically partitioned data with missing values

Improved Heuristic Equivalent Search Algorithm Based on Maximal Information Coefficient for Bayesian Network Structure Learning.

An efficient skeleton learning approach-based hybrid algorithm for identifying Bayesian network structure

Learning the Structure of Bayesian Networks: A Quantitative Assessment of the Effect of Different Algorithmic Schemes

An Efficient Procedure for Computing Bayesian Network Structure Learning

A survey of Bayesian Network structure learning

A New Hybrid Method for Learning Bayesian Networks: Separation and Reunion.

Combining gene expression data and prior knowledge for inferring gene regulatory networks via Bayesian networks using structural restrictions