A feature selection method based on synonym merging in text classification system

Haipeng Yao,Chong Liu,Peiying Zhang,Luyao Wang
DOI: https://doi.org/10.1186/s13638-017-0950-z
2017-01-01
Abstract:As an important step in natural language processing (NLP), text classification system has been widely used in many fields, like spam filtering, news classification, and web page detection. Vector space model (VSM) is generally used to extract feature vectors for representing texts which is very important for text classification. In this paper, a feature selection algorithm based on synonym merging named SM-CHI is proposed. Besides, the improved CHI formula and synonym merging are used to select feature words so that the accuracy of classification can be improved and the feature dimension can be reduced. In addition, for feature words selected by SM-CHI, this paper presented three weight calculation algorithms to explore the best feature weight update method. Finally, we designed three comparative experiments and proved the classification accuracy is the highest when choosing the improved CHI formula 2 , set the threshold α to 0.8 and use the largest weight among the synonyms to update the feature weight, respectively.
What problem does this paper attempt to address?