The Optimization of the Big Data Cleaning Based on Task Merging

Dong-Hua YANG,Ning-Ning LI,Hong-Zhi WANG,Jian-Zhong LI,Hong GAO
DOI: https://doi.org/10.11897/SP.J.1016.2016.00097
2016-01-01
Abstract:Data quality issues will result in lethal effects of big data applications,so it is needed to clean the big data with the problem of data quality.MapReduce programming framework can take advantage of parallel technology to achieve high scalability for large data cleaning.However, due to the lack of effective design,redundant computation exists in the cleaning process based on MapReduce,resulting in decreased performance.Therefore,the purpose of this paper is to optimize the parallel data cleaning process to improve efficiency.Through research,we found that some data cleaning tasks are often run on the same input file or using the same calculation results. Based on the discovery this paper presents a new optimization techniques — optimization techniques based task combinations.By merging redundant computation and several simple calculations for the same input file,we can reduce the number of rounds of MapReduce system thereby reducing the running time,and ultimately achieve system optimization.In this paper,some complex modules of data cleaning process have been optimized,respectively entity recognition module,inconsistent data recovery module,and the module of missing values filling.The experimental results show that the proposed strategy in this paper can effectively improve the efficiency of data cleaning.
What problem does this paper attempt to address?