Feature Construction Approach for Email Categorization Based on Term Space Partition
Guyue Mi,Pengtao Zhang,Ying Tan
DOI: https://doi.org/10.1109/ijcnn.2013.6707020
2013-01-01
Abstract:This paper proposes a novel feature construction approach based on term space partition (TSP) aiming to establish a mechanism to make terms play more sufficient and rational roles in email categorization. Dominant terms and general terms are separated by performing a vertical partition of the original term space with respect to feature selection metrics, while spam terms and ham terms are separated by a transverse partition with respect to class tendency. Strategies for constructing discriminative features, named term ratio and term density, are designed on corresponding subspaces. Motivation and principle of the TSP approach is presented in detail, as well as the implementation. Experiments are conducted on five benchmark corpora using cross-validation to evaluate the proposed TSP approach. Comprehensive experimental results suggest that the TSP approach far outperforms the traditional and most widely used feature construction approach in spam filtering, which is named bag-of-words, in both performance and efficiency. In comparison with the heuristic and state-of-the-art approaches, namely CFC and LC, the proposed TSP approach shows obvious advantage in terms of accuracy and μ 1 measure, as well as high precision, which is warmly welcomed in real spam filtering. Furthermore, the TSP approach performs quite similar with CFC in efficiency of processing incoming emails, while much faster than LC. In addition, it is shown that the TSP approach cooperates well with both unsupervised and supervised feature selection metrics, which endows it with flexible capability in the real world.