VertiBayes: learning Bayesian network parameters from vertically partitioned data with missing values

Florian van Daalen,Lianne Ippel,Andre Dekker,Inigo Bermejo
DOI: https://doi.org/10.1007/s40747-024-01424-0
IF: 6.7
2024-04-25
Complex & Intelligent Systems
Abstract:Abstract Federated learning makes it possible to train a machine learning model on decentralized data. Bayesian networks are widely used probabilistic graphical models. While some research has been published on the federated learning of Bayesian networks, publications on Bayesian networks in a vertically partitioned data setting are limited, with important omissions, such as handling missing data. We propose a novel method called VertiBayes to train Bayesian networks (structure and parameters) on vertically partitioned data, which can handle missing values as well as an arbitrary number of parties. For structure learning we adapted the K2 algorithm with a privacy-preserving scalar product protocol. For parameter learning, we use a two-step approach: first, we learn an intermediate model using maximum likelihood, treating missing values as a special value, then we train a model on synthetic data generated by the intermediate model using the EM algorithm. The privacy guarantees of VertiBayes are equivalent to those provided by the privacy preserving scalar product protocol used. We experimentally show VertiBayes produces models comparable to those learnt using traditional algorithms. Finally, we propose two alternative approaches to estimate the performance of the model using vertically partitioned data and we show in experiments that these give accurate estimates.
computer science, artificial intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges encountered when training Bayesian network parameters in vertically partitioned datasets, especially when dealing with missing values and multiple parties (more than two). Specifically: 1. **Handling missing values**: In practical applications, especially in the federated learning scenario, different parties may have different data collection protocols and quality standards, resulting in missing values in the data. Existing methods cannot effectively handle this situation, while VertiBayes solves this problem by introducing a new two - step method. First, an intermediate model is trained using the maximum - likelihood estimation method, treating the missing values as a special value; then, the EM algorithm is used to train the final model on the synthetic data generated by the intermediate model. 2. **Supporting any number of parties**: Most of the existing federated learning methods can only handle the scenario of two parties, which limits the diversity of data sources. VertiBayes can support any number of parties, thereby making better use of decentralized data resources and improving the representativeness and accuracy of the model. 3. **Privacy protection**: When performing structure learning and parameter learning on vertically partitioned datasets, how to ensure data privacy is an important issue. VertiBayes solves this problem by using the privacy - preserving scalar product protocol to ensure that model training is completed without revealing the original data. In summary, VertiBayes aims to provide a method that can effectively handle missing values in vertically partitioned datasets, support multiple parties, and ensure privacy protection, in order to train a Bayesian network model comparable to the traditional centralized training method.