An Improved K-means Algorithm for Document Clustering

guohua wu,hairong lin,ershuai fu,liuyang wang
DOI: https://doi.org/10.1109/CSMA.2015.20
2015-01-01
Abstract:K-Means algorithm has a major shortcoming of high dimensional and sparse data. So the traditional measurement of the distance can't deal with the data effectively. Motivated by this, this paper proposed a K-Means algorithm based on Sim Hash. After preprocessing of the text, Sim Hash is used to calculate the feature vectors extracted, and then the fingerprint of each text is obtained. Sim Hash not only reduces the dimension of the text, but also directly calculates the Hamming distance between the fingerprints as the vector distance. According to the Hamming distance, it can judge which clustering the data is belongs to. Experimental result shows that the algorithm guarantees the quality of the clustering, and greatly reduces the speed of K-means clustering algorithm.
What problem does this paper attempt to address?