K-means Document Clustering Based on Latent Dirichlet Allocation

Peng Guan
Abstract:K-means is a popular algorithm in document clustering, which is fast and efficient. The disadvantages of K-means are that it requires one to set the number of clusters first and select the initial clustering centers randomly. Latent Dirichlet Allocation (LDA) is a mature probabilistic topic model, which aids in document dimensionality reduction, semantic mining and information retrieval. We present a document clustering method based on LDA and K-means (LDA_K-means). In order to improve document clustering effect with K-means, we discover the initial clustering centers by finding the typical latent topics extracted by LDA. The effectiveness of LDA_K-means is evaluated on the 20 Newsgroups data sets. We show that LDA_K-means can significantly improve the clustering effect in contrast to clustering based on random initialization of K-means and LDA (LDA_KMR).
Mathematics,Computer Science
What problem does this paper attempt to address?