Cleaning Uncertain Data with Crowdsourcing - a General Model with Diverse Accuracy Rates
Chen Zhang,Haodi Zhang,Weiteng Xie,Nan Liu,Qifan Li,Di Jiang,Peiguang Lin,Kaishun Wu,Lei Chen
DOI: https://doi.org/10.1109/tkde.2020.3027545
IF: 9.235
2021-01-01
IEEE Transactions on Knowledge and Data Engineering
Abstract:Since inaccuracies commonly exist in many applications, data uncertainty has become an important problem in database systems. To deal with data uncertainty, probabilistic databases can be used to store uncertain data, and querying facilities are provided to yield answers with confidence. However, the results from a query or mining process may not be reliable when the uncertainty propagates in the systems. In this paper, we leverage the power of crowdsourcing by designing a set of Human Intelligence Tasks, or HITs in short, to ask a crowd to improve the quality of uncertain data. In particular, we consider crowds consists of workers with diverse accuracy rates when answering the HITs. We design solutions to maximize the data quality with minimal number of HITs. There are two obstacles for this non-trivial optimization, which lead to very high computational cost for selecting the optimal set of HITs. First, members of a crowd may return incorrect answers with different probabilities. Second, the HITs decomposed from uncertain data are often correlated. We have addressed these challenges in this paper by designing an effective approximation algorithm and an efficient heuristic solution, especially for crowds with diverse individual accuracy rates. To further improve the efficiency, we derive tight lower and upper bounds for effective filtering and estimation. Extensive experiments on both a simulated crowd and a real crowdsourcing platform are conducted to evaluate our solutions.
computer science, information systems, artificial intelligence,engineering, electrical & electronic