Parallel Topic Model and Its Application on Document Clustering.

Lidong Wang,Yuhuai Wang,Shihua Cao,Yun Zhang,Kang An
DOI: https://doi.org/10.1504/ijict.2017.10008317
2017-01-01
Abstract:This paper presents PLDACOL, our parallel implementation on LDACOL model, to effectively cluster large-scale documents. Since phrases contain more semantic information than the sum of its individual word, we use topic model LDACOL for phrase discovery, and use Gibbs sampling for parameter inference. PLDACOL overcomes the high computation time cost in parameter inference by the distributed computing framework based on Hadoop. We show that our PLDACOL can be applied to the clustering of large-scale documents in different size and produces significant improvements on both effectiveness and efficiency compared with other related traditional algorithms.
What problem does this paper attempt to address?