Semantic-based Intelligent Data Clean Framework for Big Data

Jia Wang,Zhijun Song,Qian Li,Jun Yu,Fei Chen
DOI: https://doi.org/10.1109/spac.2014.6982731
2014-01-01
Abstract:In order to overcome the limitation of existing data cleansing methods working on massive data, in this paper, we propose a generic semantic-based framework using parallelized processing model for effective big data cleansing. We also use an improved Semantic-Based Keyword Matching Algorithm to deal with duplicate data. Experimental results show that this parallelized framework with improved Semantic-Based Keyword Matching Algorithm can identify duplicates with high recall and precision and have a good performance for big data cleansing.
What problem does this paper attempt to address?