Statistical Word Sense Aware Topic Models

Guoyu Tang,Yunqing Xia,Jun Sun,Min Zhang,Thomas Fang Zheng
DOI: https://doi.org/10.1007/s00500-014-1372-z
IF: 3.732
2014-01-01
Soft Computing
Abstract:LDA has been proved effective in modeling the semantic relation between surface words. This semantic information in the document collection is useful to measure the topic distribution for a document. In general, a surface word may significantly contribute to several topics in a document collection. LDA measures the contribution of a surface word to each topic and considers a surface word to be identical across all documents. However, a surface word may present different signatures in different contexts, i.e., polysemous words can be used with different senses in different contexts. Intuitively, disambiguating word senses for topic models can enhance their discriminative capabilities. In this work, we propose a joint model to automatically induce document topics and word senses simultaneously. Instead of using some pre-defined word sense resources, we capture the word sense information via a latent variable and directly induce them in a fully unsupervised manner from the corpora. Experimental results show that the proposed joint model outperforms the baselines significantly in document clustering and improves the word sense induction as well against a stand-alone non-parametric model.
What problem does this paper attempt to address?