Research on Hadoop-based Massive short text clustering algorithm

Qiang Zhao,Yuliang Shi,Zepeng Qing
DOI: https://doi.org/10.1117/12.2540380
2019-01-01
Abstract:Many clustering algorithms work well on small data sets of less than 200 data objects. However, a large database may contain millions of objects, and clustering on such a large data set may lead to biased results. As data volumes and availability continue to grow, so does the need for large dataset analytics. Among the most commonly used clustering algorithms, K-means proved to be one of the most popular choices to provide acceptable results in a reasonable amount of time. In this paper, we present an improved k-means algorithm with better initial centroids. Also, we implement this modified algorithm on Hadoop platform. Experiments show that the improved k-means algorithm converges faster than the classic k-means and the average execution time is reduced compared to the traditional k-means.
What problem does this paper attempt to address?