A Model-Based Approach for Text Clustering with Outlier Detection

Jianhua Yin,Jianyong Wang
DOI: https://doi.org/10.1109/icde.2016.7498276
2016-01-01
Abstract:Text clustering is a challenging problem due to the high-dimensional and large-volume characteristics of text datasets. In this paper, we propose a collapsed Gibbs Sampling algorithm for the Dirichlet Process Multinomial Mixture model for text clustering (abbr. to GSDPMM) which does not need to specify the number of clusters in advance and can cope with the high-dimensional problem of text clustering. Our extensive experimental study shows that GSDPMM can achieve significantly better performance than three other clustering methods and can achieve high consistency on both long and short text datasets. We found that GSDPMM has low time and space complexity and can scale well with huge text datasets. We also propose some novel and effective methods to detect the outliers in the dataset and obtain the representative words of each cluster.
What problem does this paper attempt to address?