Abstract:Content-based spam filtering is widely used to fight against overflow of spam. However, high dimension of the feature space can result in high cost of memory as well as poor performance of spam filtering due to the noises. Dimensionality reduction can be used to improve both the efficiency and effectiveness of classifiers. Traditional dimension reduction approaches are typically categorized as feature extraction and feature selection. Despite of the more effective performance of feature extraction, the high computational complexity makes it unrealistic in application of content-based spam filtering. However, feature selection is widely used in content-based spam filtering due to its efficiency. mRMR (Minimum Redundancy-Maximum Relevance) criterion was first applied in text classification as a feature selection approach and resulted in excellent performance. However, it is seldom used in content-based spam filtering, since its relatively high computational complexity. Therefore, a much more efficient algorithm, OCFS (Orthogonal Centroid Feature Selection) algorithm, is introduced. Combined with mRMR criterion, a new two-stage dimensionality algorithm OMFS, is proposed in this paper. In the first stage, OCFS algorithm is used to select the most representative features from the original high dimensional feature space. In the Second stage, mRMR uses its criterion to further reduce the redundancy among the candidate features to obtain the final feature set. Extensive experimental comparisons were performed using three of the most widely used classifiers (Naive Bayes, Support Vector Machine and kNN) on spam corpus PUf. The experimental results showed that our method led to promising improvement in classification accuracy, F-Measure and ROCA. Copyright © 2013 Binary Information Press.

Extracting discriminative information from e-mail for spam detection inspired by Immune System

A Local-Concentration-Based Feature Extraction Approach for Spam Filtering.

Largemargin Classification for Combating Disguise Attacks on Spam Filters

Concentration Based Feature Construction Approach for Spam Detection.

An Adaptive Concentration Selection Model for Spam Detection.

Term Space Partition Based Ensemble Feature Construction For Spam Detection

Intelligent Detection Approaches for Spam

Ensemble Decision for Spam Detection Using Term Space Partition Approach

Variable Length Concentration Based Feature Construction Method for Spam Detection

A Late Multi-Modal Fusion Model for Detecting Hybrid Spam E-mail

Feature Construction Approach for Email Categorization Based on Term Space Partition

Detecting Spam E-mails with Content and Weight-based Binomial Logistic Model

Spam Filtering Based on Latent Semantic Indexing

An Adaptive Fusion Algorithm for Spam Detection

Evading obscure communication from spam emails

Combining Svm With Orthogonal Centroid Feature Selection For Spam Filtering

Fusion of text and image features: A new approach to image spam filtering

Application of Natural Language Processing and Machine Learning Boosted with Swarm Intelligence for Spam Email Filtering

Animmune Local Concentration Based Virus Detection Approach

An effective and efficient two-stage Dimensionality reduction algorithm for content-based spam filtering

IFSpard: an Information Fusion-based Framework for Spam Review Detection