Hierarchical Clustering on HDP Topics to Build a Semantic Tree from Text.

Jianfeng Si,Qing Li,Tieyun Qian,Xiaotie Deng
2012-01-01
Abstract:An ideal semantic representation of text corpus should exhibit a hierarchical topic tree structure, and topics residing at different node levels of the tree should exhibit different levels of semantic abstraction( i.e., the deeper level a topic resides, the more specific it would be). Instead of learning every node directly which is a quite time consuming task, our approach bases on a nonparametric Bayesian topic model, namely, Hierarchical Dirichlet Processes (HDP). By tuning on the topic’s Dirichlet scale parameter settings, two topic sets of different levels of abstraction are learned from the HDP separately and further integrated into a hierarchical clustering process. We term our approach as HDP Clustering(HDP-C). During the hierarchical clustering process, a lower level of specific topics are clustered into a higher level of more general topics in an agglomerative style to get the final topic tree. Evaluation of the tree quality on several real world datasets demonstrates its competitive performance.
What problem does this paper attempt to address?