Parallel K-means Algorithm for Massive Texts on Spark

Peng LIU,Jiayu TENG,Enjie DING,Lei MENG
DOI: https://doi.org/10.3969/j.issn.1003-0077.2017.04.021
2017-01-01
Abstract:Due to sharp increase of internet texts ,the processing of k-means on such data is incredibly lengthened . Some classic parallel architectures ,such as Hadoop ,have not improved the execution efficiency of K-means ,because the frequent iteration in such algorithms is hard to be efficiently handled .This paper proposed a parallelization algo-rithm of k-means based on Spark .It makes full use of in-memory-computing RDD model of Spark so as to well meet the frequent iteration requirement of k-means .Experimental results show that k-means executes much more effi-ciently in Spark than in Hadoop on the same datasets and the same computing environments .
What problem does this paper attempt to address?