Short Text Model Based on Strong Feature Thesaurus

Wentao Lu,Yongfeng Huang,Xing Li,Zhuo Zhang,Yingkun Li
DOI: https://doi.org/10.2991/isrme-15.2015.126
2015-01-01
Abstract:Data Sparseness, the evident characteristic of short text, is caused by the diversity of language expression and the short text length. The previous text models represented by Bag of Word (BOW) only considers the statistical feature of words, and thus always underperformed when it comes to short texts. To tackle this problem, we introduced a new text model by combining the statistical method and semantic estimation. Specifically, we managed to obtain the " Strong Feature Thesaurus" through mining process with Latent Dirichlet allocation (LDA) model, and then the semantic information is incorporated in the BOW by weighting those strong feature terms. To assess the performance of this model, we conduct two experiments of the clustering of short text corpuses. The results have shown that our model outperform the prevailing text models such as BOW.
What problem does this paper attempt to address?