Tackling topic general words in topic modeling.

Yueshen Xu,Yuyu Yin,Jianwei Yin
DOI: https://doi.org/10.1016/j.engappai.2017.04.009
2017-01-01
Abstract:Topic models are a prevailing tool for exploring latent topics in documents, and for helping to complete many NLP tasks. To obtain good topics for a corpus, a preprocessing step is often needed to remove common stop words and identify topic general words (TGW) from the corpus. Such words can seriously harm the topic formation because they create spurious co-occurrence of unrelated words. Also, they are likely to occupy top positions of multiple topics, lead to many unrelated words being grouped under a topic, and consequently result in inscrutable and similar topics. In an application, one typically manually identifies and removes a list of TGWs in the corpus. This is a time consuming process and very hard to do by a layman user. In this paper, we aim to solve this problem automatically. The proposed approaches can be based on the current corpus alone or multiple corpora. In the latter case, a novel continuous learning method is proposed that learns from past results of multiple domain corpora to help identify TGWs in the current domain. We conduct experiments in two real-world datasets, and the experimental results show that the proposed approaches achieve superior results. HighlightsStudy the problem of topic general words in topic modeling.Propose a metric generality score to measure the generality of a word.Propose a new topic model generality-sensitive LDA to exploit generality scores in modeling.Propose a continuous learning approach that can use multiple domains to find topic general words.
What problem does this paper attempt to address?