Short Text Topic Modeling With Flexible Word Patterns

Xiaobao Wu,Chunping Li
DOI: https://doi.org/10.1109/IJCNN.2019.8852366
2019-01-01
Abstract:Since effective semantic representations are utilized in many practical applications, inferring discriminative and coherent latent topics from short texts is a critical and basic task. Traditional topic models like Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA) behave not well on short texts due to data sparsity problem. One novel model called Biterm Topic Model (BTM) which models unordered wordpairs (i.e., biterms) from whole corpus was proposed to solve this problem. However, both the performance and efficiency of BTM are reduced because of many irrelevant and useless biterms. In this paper, we propose a Multiterm Topic Model (MTM) for short text topic modeling. MTM extracts variable-length and more correlative word patterns (i.e., multiterms) from the whole corpus. By directly modeling the generative process of multiterms, MTM can infer the word distributions of each topic and the topic distribution of each short text to alleviate the sparsity problem in short text modeling. With the the proper amount of flexible multiterms, learning process of MTM is enhanced. Through extensive experiments on two real-world short text collections, we show that MTM is more efficient and outperforms the baseline models in terms of topic coherence and text classification.
What problem does this paper attempt to address?