Topic Discovery from Heterogeneous Texts

Jipeng Qiang,Ping Chen,Wei Ding,Tong Wang,Fei Xie,Xindong Wu
DOI: https://doi.org/10.1109/ictai.2016.0039
2016-01-01
Abstract:Recently many topic models such as Latent Dirich-let Allocation (LDA) have made important progress towards generating high-level knowledge from a large corpus. They assume that a text consists of a mixture of topics, which is usually the case for regular articles but may not hold for a short text that usually contains only one topic. In practice, a corpus may include both short texts and long texts, in this case neither methods developed for only long texts nor methods for only short texts can generate satisfying results. In this paper, we present an innovative method to discover latent topics from a heterogeneous corpus including both long and short texts. A new topic model based on collapsed Gibbs sampling algorithm is developed for modeling such heterogeneous texts. The experiments on real-world datasets validate the effectiveness of the proposed model in comparison with other state-of-the-art models.
What problem does this paper attempt to address?