Adaptive Email Spam Filtering Based on Information Theory

Xin Zhang,Wenyuan Dai,Gui-Rong Xue,Yong Yu
DOI: https://doi.org/10.1007/978-3-540-76993-4_14
2007-01-01
Abstract:Most previous email spam filtering techniques rely on traditional classification learning which assumes the data from training and test sets are drawn from the same underlying distribution. However, in practice, this identical-distribution assumption often violates. In general, email service providers collect training data from various public available resources, while the tasks focus on users' individual inboxes. Topics in the mail-boxes vary among different users, and distributions shift as a result. In this paper, we propose an adaptive email spam filtering algorithm based on information theory which relaxes the identical-distribution assumption and adapts the knowledge learned from one distribution to another. Our work focuses on the content analysis which minimizes the loss in mutual information between email instances and word features, before and after classification. We present theoretical and empirical analyses to show that our algorithm is able to solve the adaptive email spam filtering problem well. The experimental results show that our algorithm greatly improves the accuracy of email filtering, against the traditional classification algorithms, while scaling very well.
What problem does this paper attempt to address?