Cleaning Missing Data Based on the Bayesian Network.

Liang Duan,Kun Yue,Wenhua Qian,Weiyi Liu
DOI: https://doi.org/10.1007/978-3-642-39527-7_34
2013-01-01
Abstract:To guarantee the data quality, it is necessary to clean the missing data that prevalently exist in real world databases. By incorporating additional information, such as functional dependencies or integrity constraints, the correct value for each missing data item can be derived in many existing data cleaning methods. In this paper, we propose a method for cleaning the missing data item without additional information by adopting Bayesian network (BN) as the framework of the representation and inferences of probability distributions. First, we learn a Bayesian network from the complete part of the given incomplete database, called IBN. Then, we infer the probability distributions of each missing data item based on Gibbs sampling upon the IBN. Consequently, we obtain all possible values with their corresponding probability distributions (i.e., confidence degrees), by which we clean the incomplete databases. Experimental results showed the efficiency, accuracy and precision of our methods. © 2013 Springer-Verlag.
What problem does this paper attempt to address?