An Improved Parallel K-means Clustering Algorithm with MapReduce

Qing Liao,Fan Yang,Jingming Zhao
DOI: https://doi.org/10.1109/icct.2013.6820477
2013-01-01
Abstract:The K-means algorithm is one of the most well-known clustering algorithms that has been frequently used to variety of problems. However, its processing performance has usually encountered a bottleneck if used to deal with massive data. Since MapReduce as the most popular cloud computing parallel framework is effective to handle massive data, the researches of K-means clustering algorithm which is based on MapReduce become a focus for scholars. In this paper, an improved parallel K-means clustering algorithm is proposed based on MapReduce which improve the performance of traditional ones by decreasing the number of iterations and accelerating processing the speed of per iteration. Firstly, the authors put forward approach to decide the distance measure through comparing the Euclidean distance and Manhattan distance. And then, the authors give the method to select the initial centroids which are consistent with the distribution of the data. According to simulation, the improved parallel K-means algorithm based on MapReduce can achieve higher processing speed and stability than the traditional ones.
What problem does this paper attempt to address?