A Hybrid Machine-Crowdsourcing Approach For Web Table Matching And Cleaning

Chunhua Li,Pengpeng Zhao,Victor S. Sheng,Zhixu Li,Guanfeng Liu,Jian Wu,Zhiming Cui
DOI: https://doi.org/10.1007/978-3-319-39958-4_11
2016-01-01
Abstract:Table matching and data cleaning are two crucial activities in integrating data from different web tables, which have traditionally been considered as separate activities. We show that data cleaning can effectively help us discover table matches, and vice versa. In this paper, we study a hybrid machine-crowdsourcing approach to handle the two activities together with a well-developed knowledge base. Understanding the semantics of tables is fundamental to both matching and cleaning. We select the most valuable columns to crowdsourcing validation and infer others by consolidating crowdsourcing results and machine-generated results. When resolving inconsistency between data and semantics, relative trust is taken into account to validate data or semantics via crowd. Our experimental results show the effectiveness of the proposed approach for matching and cleaning web tables using real-life datasets.
What problem does this paper attempt to address?