Incorporating Wikipedia Concepts and Categories As Prior Knowledge into Topic Models.
Kang Xu,Guilin Qi,Junheng Huang,Tianxing Wu
DOI: https://doi.org/10.3233/ida-160021
IF: 1.7
2017-01-01
Intelligent Data Analysis
Abstract:Topic models have been widely applied in discovering topics that underly a collection of documents. Incorporating human knowledge can guide conventional topic models to produce topics which are easily interpreted and semantically coherent. Several knowledge-based topic models have been proposed, but these models just leverage lexical knowledge of words that are often not in accordance with topics. To solve the problem, we recognize entity mentions, besides words, in the documents and incorporate entity knowledge from external knowledge bases. In this paper, we study to utilize entity knowledge, concepts and categories in Wikipedia, as prior knowledge into topic models to discover more coherent topics. A novel knowledge-based topic model, WCM-LDA (Wikipedia-Category-concept-Mention Latent Dirichlet Allocation), is proposed, which not only models the relationship between words and topics, but also utilizes concept and category knowledge of entities to model the semantic relation of entities and topics. We compare WCM-LDA with the state-of-the-art knowledge-based topic models, on three datasets. Experimental results show that our approach outperforms the existing baseline methods on all three datasets. Moreover, our model can visualize topics with top words, concepts and categories such that topics are made easily to be interpreted and classified.