Topical Paragraph Vector learning

Qinlong Wang,Ruifang Liu,Hongqiao Li,Wenbin Guo
DOI: https://doi.org/10.1109/ICNC.2015.7377987
2015-01-01
Abstract:Word embeddings are distributed representations of word features. Despite its effectiveness, most word embeddings share a common problem that each word is represented with a single vector, which fails to capture homonymy and polysemy. In this paper, we propose Topical Paragraph Vector (TPV) which is similar to word embedding training method. We also use ordering and semantics of words as features during training. In addition, we employ latent topic model to assign specific topics to each word given the contexts of the documents. With the proposed TPV model, we obtain multiple word embeddings for each word implicitly in the latent space. Thus we overcome the weakness of single word embedding to certain extents. Furthermore, our model combines word embedding within the document as a vector for more semantic-enriched document level representation. From our experiments, we can see that it outperforms the baseline model on text classification task in 20_Newsgroup corpus.
What problem does this paper attempt to address?