A Distributed Load Balance Algorithm of MapReduce for Data Quality Detection

Yitong Gao,Yan Zhang,Hongzhi Wang,Jianzhong Li,Hong Gao
DOI: https://doi.org/10.1007/978-3-319-32055-7_24
2016-01-01
Abstract:Big data quality detection is a valuable problem in data quality field. MapReduce is an important distributed data processing model mainly for big data processing. Load balance is a key factor that influences the property of MapReduce. In this paper, we propose a distributed greedy approximation algorithm for load balance problem in MapReduce for data quality detection. There are three key challenges: (a) reduce the problem to NP-complete and prove a considerable approximation ratio of the proposed algorithm, (b) just impose one more round of MapReduce than conventional processing and occupy minimal time in the total process, (c) be simple and convenient feasible. Experimental results on real-life and synthetic data demonstrate that the proposed algorithm in this paper is effective for load balance.
What problem does this paper attempt to address?