An Effective and Cost-Based Framework for a Qualitative Hybrid Data Deduplication

Charles R. Haruna,MengShu Hou,Moses J. Eghan,Michael Y. Kpiebaareh,Lawrence Tandoh
DOI: https://doi.org/10.1007/978-981-13-6861-5_44
2019-01-01
Abstract:In real world, entities may occur several times in a database. These duplicates may have varying keys and/or include errors that make deduplication a difficult task. Deduplication cannot be solved accurately using either machine-based or crowdsourcing techniques only. Crowdsourcing were used to resolve the shortcomings of machine-based approaches. Compared to machines, the crowd provided relatively accurate results, but with a slow execution time and very expensive too. A hybrid technique for data deduplication using a Euclidean distance and a chromatic correlation clustering algorithm was presented. The technique aimed at: reducing the crowdsourcing cost, reducing the time the crowd use in deduplication and finally providing higher accuracy in data deduplication. In the experiments, the proposed algorithm was compared with some existing techniques and outperformed some, offering an utmost deduplication accuracy efficiency and also incurring low crowdsourcing cost.
What problem does this paper attempt to address?