Performance Comparison of Clustering Algorithms in Spark

Mo HAI,You ZHANG
2017-01-01
Computer Science
Abstract:The performance of three typical clustering algorithms which are K-means,Bisecting K-means and Gaussian Mixture in Spark,were compared by the experiments from runtime,speedup,scalability and size up.The results show that when the scale of the dataset is hundreds of megabytes,as the number of nodes increases,the runtime of the three algorithms decreases more obviously.When the scale of the dataset is larger than 500MB,the speedup of the three algorithms increases more obviously,and the speedup increases linearly with the increase of the number of nodes.The scalability of the three algorithms decreases with the increase of the number of nodes.When the scale of the dataset is larger than 500MB,the scalability of the Bisecting K-means algorithm is the lowest compared to that of the K-means and Gaussian Mixture algorithm.When the scale of the dataset is larger than 100MB,the sizeup of the Gaussian Mixture algorithm is much larger than that of K-means algorithm and bisecting K-mean algorithm.
What problem does this paper attempt to address?