A Topic Model for Hierarchical Documents

Yang Yang,Feifei Wang,Fei Jiang,Shuyuan Jin,Jin Xu
DOI: https://doi.org/10.1109/DSC.2016.97
2016-01-01
Abstract:Uncovering the topics over short text corpus has become increasingly important with the bursty development of online communications. However, conventional topic mining methods may fail due to the lack of contexts in each of the short text. Fortunately, a large proportion of online short texts often co-occur with lengthy texts, such as reviews with product descriptions and comments with news articles. These two kinds of texts are hierarchically organized and the hidden topical relationships between them can be utilized to enhance topic learning for both sides. Therefore, in this paper, we propose a topic model for (h)ierarchical (d)ocuments, referred as hdLDA, to capture the hierarchical structure of these texts. Specifically, in hdLDA each short text has a probability distribution over two topics, one from a set of topics underlying lengthy texts and the other one from a topic set formed only by short texts. Through this assumption, the topics of short texts and lengthy documents in hdLDA are learned in a mutually reinforced way. Extensive experiments on a dataset of news articles and user comments demonstrate that our approach discovers more prominent and comprehensive topics for both short texts and lengthy documents, compared with baseline and state-of-art methods.
What problem does this paper attempt to address?