Abstract:Content-based spam filtering is widely used to fight against overflow of spam. However, high dimension of the feature space can result in high cost of memory as well as poor performance of spam filtering due to the noises. Dimensionality reduction can be used to improve both the efficiency and effectiveness of classifiers. Traditional dimension reduction approaches are typically categorized as feature extraction and feature selection. Despite of the more effective performance of feature extraction, the high computational complexity makes it unrealistic in application of content-based spam filtering. However, feature selection is widely used in content-based spam filtering due to its efficiency. mRMR (Minimum Redundancy-Maximum Relevance) criterion was first applied in text classification as a feature selection approach and resulted in excellent performance. However, it is seldom used in content-based spam filtering, since its relatively high computational complexity. Therefore, a much more efficient algorithm, OCFS (Orthogonal Centroid Feature Selection) algorithm, is introduced. Combined with mRMR criterion, a new two-stage dimensionality algorithm OMFS, is proposed in this paper. In the first stage, OCFS algorithm is used to select the most representative features from the original high dimensional feature space. In the Second stage, mRMR uses its criterion to further reduce the redundancy among the candidate features to obtain the final feature set. Extensive experimental comparisons were performed using three of the most widely used classifiers (Naive Bayes, Support Vector Machine and kNN) on spam corpus PUf. The experimental results showed that our method led to promising improvement in classification accuracy, F-Measure and ROCA. Copyright © 2013 Binary Information Press.

WEIGHTED NAIVE BAYES SPAM FILTERING METHOD BASED ON FEATURE TERM DISCRIMINATION

Complex Network Based SMS Filtering Algorithm

Effective spam filter based on a hybrid method of header checking and content parsing

Improving Short Text Classification Through Better Feature Space Selection

A Local-Concentration-Based Feature Extraction Approach for Spam Filtering.

Feature Selection Method on Imbalanced Text

A Spam Filtering Method Based on Bayesian Neural Network

Spam Message Self-Adaptive Filtering System Based on Naive Bayes and Support Vector Machine

Spam message online filtering system based on hash function and naive Bayes

An effective and efficient two-stage Dimensionality reduction algorithm for content-based spam filtering

A Hybrid Model with New Word Weighting for Fast Filtering Spam Short Texts

Feature Importance Analysis for Spammer Detection in Sina Weibo

Feature Construction Approach for Email Categorization Based on Term Space Partition

A Composite Intelligent Method For Spam Filtering

Classify E-mails by Support Vector Machine

Using modified term frequency to improve term weighting for text classification

Two-step Based Feature Selection Method for Filtering Redundant Information

PRIS Kidult Anti-SPAM Solution at the TREC 2005 Spam Track: Improving the Performance of Naive Bayes for Spam Detection.

Parameter Optimization of Local-Concentration Model for Spam Detection by Using Fireworks Algorithm.

Implementation and Evaluation of Chinese Spam Filtering System

Term Space Partition Based Ensemble Feature Construction For Spam Detection