Abstract:In this dissertation we concentrate on learning Bayesian Networks (BN) from distributed heterogeneous databases. We need to develop distributed techniques that save communication overhead, offer better scalability, and require minimal communication of possibly secure data. The objective of this work is to learn a collective BN from data that is distributed among geographically diverse sites. The data distribution is heterogeneous. The collective BN must be close to a BN learned by a centralized method and must require only a small amount of data transmission among different sites. In general, the collective learning algorithms have four steps: local learning, sample selection, cross learning, and combination. The key points in the proposed methods are: (1) use the BN decomposability property; (2) identify the samples that are most likely to be evidence of cross terms. We show that low-likelihood samples in each site are most likely to be the evidence of cross terms. One collective structure learning and two collective parameter learning methods are proposed. For structure learning, the collective method can find the correct structure of local variables by choosing a base structure learning algorithm with the decomposability property. Some extra links may be introduced due to the hidden variable problem. Sample selection chooses low-likelihood samples in local sites and transmits them to a central site. In cross learning, the structure of cross variables and cross set are identified. In combination, we add all cross links and remove extra local links. For parameter learning, Collective Method 1 (CM1) and Collective Method 2 (CM2) can learn a BN which is close to Bcntr using a small portion of samples. Local learning learns parameters for local variables. Cross learning learns the parameters of cross variables. The combination step aggregates the parameters of local variables and cross variables. In order to handle applications with real-time constraints, we have developed CM2. Using a notion of cross set, CM2 chooses a subset of features in a local site to do the likelihood computation and data selection. This can greatly reduce the local computation and the data transmission overhead. Experimental results demonstrate the efficiency and accuracy of these methods.

Collective Approach for Bayesian Network Learning from Distributed Heterogeneous Database