A Probabilistic Topic Model with Noise Reduction Ability

Jing LI,Yongbin QIN,Ruizhang HUANG
DOI: https://doi.org/10.3969/j.issn.1672-9722.2017.02.032
2017-01-01
Abstract:With the arrival of big data era, recognizing and analyzing the hidden structure of text data efficiently has been more and more important.Powerful computational tools are needed to help understand text data better.Probabilistic topic models, especially the Latent Dirichlet Allocation (referred as LDA) model, have been proposed and applied in machine learning and text mining widely.Because the LDA model has very poor ability to distinguish similar topics, which has a bad influence on its practical performance.In order to solve this important problem, a new topic model named Noise Reduction Latent Dirichlet Allocation (referred as NRLDA) is proposed on the basis of LDA.There are a lot noise words making no contribution to discriminating similar topics, so this phenomenon is taken into consideration by introducing new variables to distinguish the different generative processes of noise words and non-noise words, which is absolutely beyond LDA's ability.Besides, a gibbs sampler is developed to infer NRLDA's parameters which is critical to investigating the structure of text corpus.Experimental results show that NRLDA model has a much stronger ability to differentiate similar topics, which proves that the idea in our model is reasonable.
What problem does this paper attempt to address?