A Text Clustering Algorithm Using an Online Clustering Scheme for Initialization

Jianhua Yin,Jianyong Wang
DOI: https://doi.org/10.1145/2939672.2939841
2016-01-01
Abstract:In this paper, we propose a text clustering algorithm using an online clustering scheme for initialization called FGSDMM+. FGSDMM+ assumes that there are at most K-max clusters in the corpus, and regards these K-max potential clusters as one large potential cluster at the beginning. During initialization, FGSDMM+ processes the documents one by one in an online clustering scheme. The first document will choose the potential cluster, and FGSDMM+ will create a new cluster to store this document. Later documents will choose one of the non-empty clusters or the potential cluster with probabilities derived from the Dirichlet multinomial mixture model. Each time a document chooses the potential cluster, FGSDMM+ will create a new cluster to store that document and decrease the probability of later documents choosing the potential cluster. After initialization, FGSDMM+ will run a collapsed Gibbs sampling algorithm several times to obtain the final clustering result. Our extensive experimental study shows that FGSDMM+ can achieve better performance than three other clustering methods on both short and long text datasets.
What problem does this paper attempt to address?