An Improved K-means Distributed Clustering Algorithm Based on Spark Parallel Computing Framework

Xin Lu,Huanghuang Lu,Jiao Yuan,Xun Wang
DOI: https://doi.org/10.1088/1742-6596/1616/1/012065
2020-01-01
Journal of Physics Conference Series
Abstract:Traditional K-means distributed clustering algorithm has many problems in clustering big data, such as unstable clustering results, poor clustering results and low execution efficiency. In this paper, a density based initial clustering center selection method is proposed to improve the K-means distributed clustering algorithm. The algorithm uses the sample density, the distance between clusters and the cluster compact density, defines the product of the three as the difference weight density, and finds the sample point with the maximum difference weight density as the initial cluster center, so as to solve the problem of randomness and low quality of initial cluster center selection. At the same time, this paper uses spark parallel computing framework to implement the improved algorithm to further improve the processing performance of the algorithm in big data clustering.The experimental results show that the improved k-means distributed clustering algorithm based on spark parallel computing framework has higher execution efficiency, accuracy and good stability in big data clustering analysis.
What problem does this paper attempt to address?