A Research on Mapreduce-Based Redundancy Pruning for Record Linkage

Fengyu Yang,Ying Chen
DOI: https://doi.org/10.4028/www.scientific.net/amr.753-755.3009
2013-01-01
Abstract:To improve efficiency for record linkage with keeping high recall, technique of multiple signatures that groups an object into several clusters have been applied in many domains. Thus leads to redundant comparisons for a pair of source and target object. Based on MapReduce model, we propose a redundancy pruning approach to prune redundant pairs before final similarity computation. Our approach is implemented on two consecutive MapReduce phase, and then is evaluated on 2 practical datasets and shows good pruning ability.
What problem does this paper attempt to address?