Uyghur text clustering based on semantic word set

Shengwei Tian,Xianmin Zhai,Long Yu,Hanjun Guo
2013-01-01
Journal of Computational Information Systems
Abstract:In view of problems of high dimension, the sparsity of information and inconsideration of semantic relation between words of TF-IDF space vector, a method that uses semantic word set as features to reduce dimension and strengthen information density is proposed. This study uses the latent semantic analysis algorithm to obtain the semantic relations between words, and establishes the semantic dictionary by ESD, then we use the word set as features to express text features, and form TCSD combining with the clustering algorithm to cluster the corpus. The experimental results show that the precision rate is 94.29% and the recall rate is 94.28%, which indicate that TCSD performs better than the algorithms that use words as features. Copyright © 2013 Binary Information Press.
What problem does this paper attempt to address?