Parallel implementation of K-Means clustering algorithm based on mapReduce computing model of hadoop

Hongbo Xu,Nianmin Yao,Qilong Han,Haiwei Pan
2015-01-01
Abstract:In recent years, data clustering has been studied extensively and a lot of methods and theories have been achieved. However, with the development of the database and the popularity of Internet, a lot of new challenges such as Big Data and Cloud Computing lie in the research on data clustering. The paper presents a parallel k-means clustering algorithm based on MapReduce computing model of Hadoop platform. The MapReduce computing model has two phases: a map phase and a reduce phase. The map phase calculates the distances between each point and each cluster and assigns each point to its nearest cluster. All the points which belong to the same cluster are sent to a single reduce phase. The reduce phase calculates the new cluster centers for the next MapReduce job. Experiments on different sizes of datasets demonstrate that the proposed algorithm shows good performance on the speedup, the scaleup and the sizeup. Thus it fits to data clustering on huge datasets.
What problem does this paper attempt to address?