An Efficient K-Means Clustering Algorithm On Mapreduce

Qiuhong Li,Peng Wang,Wei Wang,Hao Hu,Zhongsheng Li,Junxian Li
DOI: https://doi.org/10.1007/978-3-319-05810-8_24
2014-01-01
Abstract:As an important approach to analyze the massive data set, an efficient k-means implementation on MapReduce is crucial in many applications. In this paper we propose a series of strategies to improve the efficiency of k-means for massive high-dimensional data points on MapReduce. First, we use locality sensitive hashing (LSH) to map data points into buckets, based on which, the original data points is converted into the weighted representative points as well as the outlier points. Then an effective center initialization algorithm is proposed, which can achieve higher quality of the initial centers. Finally, a pruning strategy is proposed to speed up the iteration process by pruning the unnecessary distance computation between centers and data points. An extensive empirical study shows that the proposed techniques can improve both efficiency and accuracy of k-means on MapReduce greatly.
What problem does this paper attempt to address?