A Joint Topical N-Gram Language Model Based on LDA

Xiaojun Lin,Dan Li,Xihong Wu
DOI: https://doi.org/10.1109/IWISA.2010.5473439
2010-01-01
Abstract:In this paper, we propose a novel joint topical n-gram language model that combines the semantic topic information with local constraints in the training procedure. Instead of training the n-gram language model and topic model independently, we estimate the joint probability of latent semantic topic and n-gram directly. In this procedure Latent Dirichlet allocation (LDA) is employed to compute latent topic distributions for sentence instances. Not only does our model capture the long-range dependencies, it also distinguishes the probability distribution of each n-gram in different topics without leading to the problem of data sparseness. Experiments show that our model can lower the perplexity significantly and it is robust on topic numbers and training data scales.
What problem does this paper attempt to address?