Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach

F. Janez-Martino, R. Alaiz-Rodriguez, V. Gonzalez-Castro, E. Fidalgo, E. Alegre
2024-02-08
Abstract:Spam emails are unsolicited, annoying and sometimes harmful messages which may contain malware, phishing or hoaxes. Unlike most studies that address the design of efficient anti-spam filters, we approach the spam email problem from a different and novel perspective. Focusing on the needs of cybersecurity units, we follow a topic-based approach for addressing the classification of spam email into multiple categories. We propose SPEMC-15K-E and SPEMC-15K-S, two novel datasets with approximately 15K emails each in English and Spanish, respectively, and we label them using agglomerative hierarchical clustering into 11 classes. We evaluate 16 pipelines, combining four text representation techniques -Term Frequency-Inverse Document Frequency (TF-IDF), Bag of Words, Word2Vec and BERT- and four classifiers: Support Vector Machine, N\"aive Bayes, Random Forest and Logistic Regression. Experimental results show that the highest performance is achieved with TF-IDF and LR for the English dataset, with a F1 score of 0.953 and an accuracy of 94.6%, and while for the Spanish dataset, TF-IDF with NB yields a F1 score of 0.945 and 98.5% accuracy. Regarding the processing time, TF-IDF with LR leads to the fastest classification, processing an English and Spanish spam email in and on average, respectively.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that most of the current research focuses on designing efficient anti - spam filters, and these studies usually use binary classification methods to distinguish between legitimate emails and spam. However, starting from the needs of network security units, this paper proposes a topic - based method, aiming to classify spam into multiple categories, thereby identifying the types of spam more meticulously. This method can not only determine whether an email is spam, but also further determine its specific category, such as malware, phishing or fraud, etc., which is of great significance for preventing cyber - attacks or activities targeting specific targets. In this way, network security incidents can be better handled, enterprises and citizens can be protected, and early warnings can be issued in a timely manner. The main contributions of the paper include: 1. Using the hierarchical clustering algorithm to analyze and investigate the text part of spam emails and divide them into categories based on network security topics. 2. Proposing an email pre - processing method for extracting the text content in spam emails, especially considering two techniques commonly used by spam senders: (i) embedding part (or all) of the spam message in an image; (ii) hiding random text (called "salting") in the body of the email. 3. Creating a new dataset, called Spam Email Multiclassification (SPEMC), which is divided into two subsets: one contains English spam emails, and the other contains Spanish spam emails. Each subset contains approximately 15,000 spam emails, which are labeled as 11 predefined categories. 4. Introducing a framework that uses machine learning and natural language processing techniques to classify spam into network security categories. This framework can be integrated into tools and services aimed at serving citizens and organizations to help them identify harmful spam, such as ransom hackers, false rewards, identity fraud or false job opportunities, etc. In addition, in order to extract all valuable text from spam emails, the paper also detects two common techniques of spam senders in the dataset and proposes solutions to reduce their impact on classification. For spam emails containing images, use OCR technology to extract text instead of ignoring it; for hidden interfering text, instead of looking at HTML tags, use OCR technology to extract user - visible text. These methods help to improve the accuracy and efficiency of spam multi - classification.