An effective and efficient two-stage Dimensionality reduction algorithm for content-based spam filtering

Yu Feng,Hongliang Zhou
2013-01-01
Journal of Computational Information Systems
Abstract:Content-based spam filtering is widely used to fight against overflow of spam. However, high dimension of the feature space can result in high cost of memory as well as poor performance of spam filtering due to the noises. Dimensionality reduction can be used to improve both the efficiency and effectiveness of classifiers. Traditional dimension reduction approaches are typically categorized as feature extraction and feature selection. Despite of the more effective performance of feature extraction, the high computational complexity makes it unrealistic in application of content-based spam filtering. However, feature selection is widely used in content-based spam filtering due to its efficiency. mRMR (Minimum Redundancy-Maximum Relevance) criterion was first applied in text classification as a feature selection approach and resulted in excellent performance. However, it is seldom used in content-based spam filtering, since its relatively high computational complexity. Therefore, a much more efficient algorithm, OCFS (Orthogonal Centroid Feature Selection) algorithm, is introduced. Combined with mRMR criterion, a new two-stage dimensionality algorithm OMFS, is proposed in this paper. In the first stage, OCFS algorithm is used to select the most representative features from the original high dimensional feature space. In the Second stage, mRMR uses its criterion to further reduce the redundancy among the candidate features to obtain the final feature set. Extensive experimental comparisons were performed using three of the most widely used classifiers (Naive Bayes, Support Vector Machine and kNN) on spam corpus PUf. The experimental results showed that our method led to promising improvement in classification accuracy, F-Measure and ROCA. Copyright © 2013 Binary Information Press.
What problem does this paper attempt to address?