A Probabilistic Ranking Framework for Web-Based Relational Data Imputation

Zhaoqiang Chen,Qun Chen,Jiajun Li,Zhanhuai Li,Lei Chen
DOI: https://doi.org/10.1016/j.ins.2016.03.036
IF: 8.1
2016-01-01
Information Sciences
Abstract:Due to richness of information on web, there is an increasing interest to search for missing attribute values in relational data on web. Web-based relational data imputation has to first extract multiple candidate values from web and then rank them by their matching probabilities. However, effective candidate ranking remains challenging because web documents are unstructured and popular search engines can only provide with relevant but not necessarily semantically matching information.In this paper, we propose a novel probabilistic approach for ranking the web-retrieved candidate values. It can integrate various influence factors, e.g. snippet rank order, occurrence frequency, occurrence pattern, and keyword proximity, in a single framework by semantic reasoning. The proposed framework consists of snippet influence model and semantic matching model. The snippet influence model measures the influence of a snippet, and the semantic matching model measures the semantic similarity between a candidate value in a snippet and a missing relational value in a tuple. We also present effective probabilistic estimation solutions for both models. Finally, we empirically evaluate the performance of the proposed framework on real datasets. Our extensive experiments demonstrate that it outperforms the state-of-the-art techniques by considerable margins on imputation accuracy.
What problem does this paper attempt to address?