Training SVM Email Classifiers Using Very Large Imbalanced Dataset
Lili Diao,Chengzhong Yang,Hao Wang
DOI: https://doi.org/10.1080/0952813x.2011.610033
2011-01-01
Journal of Experimental & Theoretical Artificial Intelligence
Abstract:The Internet has been flooded with spam emails, and during the last decade there has been an increasing demand for reliable anti-spam email filters. The problem of filtering emails can be considered as a classification problem in the field of supervised learning. Theoretically, many mature technologies, for example, support vector machines (SVM), can be used to solve this problem. However, in real enterprise applications, the training data are typically collected via honeypots and thus are always of huge amounts and highly biased towards spam emails. This challenges both efficiency and effectiveness of conventional technologies. In this article, we propose an undersampling method to compress and balance the training set used for the conventional SVM classifier with minimal information loss. The key observation is that we can make a trade-off between training set size and information loss by carefully defining a similarity measure between data samples. Our experiments show that the SVM classifier provides a better performance by applying our compressing and balancing approach.