Accurate Knn Chinese Text Classification Via Multiple Strategies
Xiulan Hao,Chenghong Zhang,Xiaopeng Tao,Shuyun Wang,Yunfa Hu
DOI: https://doi.org/10.1109/FSKD.2007.132
2007-01-01
Abstract:Text classification is one of means to understand text content. It is widely used in information retrieving, filtering spam, monitoring ill gossips, and blocking pornographic and evil messages. kNN is widely used in text categorization, but it suffers from biased training data set. In developing Prototype of Internet Information Security for Shanghai Council of Information and Security, we detect that when training data set is biased, almost all test documents of some rare (smaller) categories are classified into common (larger) ones by traditional kNN classifier The performance of text classification can not satisfy the user's requirement in this case. To alleviate such a misfortune, we adopt 2 measures to boost kNN classifier Firstly, we optimize features by removing some candidate features. Secondly, we modify traditional decision rules by integrating number of training samples of each category with them. Exhaustive experiments illustrate that the adapted kNN achieves significant classification performance improvement on biased corpora.