Research on Efficient K_Means Parallel Algorithm Based on Hadoop Distributed Architecture

Lin Qian,Lin Wang,Zhu Mei,Jun Yu,Guangxin Zhu,Debing Song,Mingjie Xu
DOI: https://doi.org/10.1088/1757-899x/452/4/042066
2018-01-01
IOP Conference Series Materials Science and Engineering
Abstract:Focusing on the problems of K-means algorithm that has high time complexity, slow convergence, lower clustering accuracy, slow operating speed, an efficient K-means parallel algorithm based on Hadoop system and MapReduce framework is proposed. Firstly, the algorithm uses K selective sorting algorithm to improve the sampling efficiency; Secondly, the iterative center is updated by using the weight replacement policy; finally, the initial center point is obtained based on the sample pretreatment strategy. Experimental results show that the proposed algorithm not only has good convergence, accuracy and speedup, but also can improve performance of the algorithm.
What problem does this paper attempt to address?