Abstract:Spam emails are unsolicited, annoying and sometimes harmful messages which may contain malware, phishing or hoaxes. Unlike most studies that address the design of efficient anti-spam filters, we approach the spam email problem from a different and novel perspective. Focusing on the needs of cybersecurity units, we follow a topic-based approach for addressing the classification of spam email into multiple categories. We propose SPEMC-15K-E and SPEMC-15K-S, two novel datasets with approximately 15K emails each in English and Spanish, respectively, and we label them using agglomerative hierarchical clustering into 11 classes. We evaluate 16 pipelines, combining four text representation techniques -Term Frequency-Inverse Document Frequency (TF-IDF), Bag of Words, Word2Vec and BERT- and four classifiers: Support Vector Machine, N\"aive Bayes, Random Forest and Logistic Regression. Experimental results show that the highest performance is achieved with TF-IDF and LR for the English dataset, with a F1 score of 0.953 and an accuracy of 94.6%, and while for the Spanish dataset, TF-IDF with NB yields a F1 score of 0.945 and 98.5% accuracy. Regarding the processing time, TF-IDF with LR leads to the fastest classification, processing an English and Spanish spam email in and on average, respectively.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that most of the current research focuses on designing efficient anti - spam filters, and these studies usually use binary classification methods to distinguish between legitimate emails and spam. However, starting from the needs of network security units, this paper proposes a topic - based method, aiming to classify spam into multiple categories, thereby identifying the types of spam more meticulously. This method can not only determine whether an email is spam, but also further determine its specific category, such as malware, phishing or fraud, etc., which is of great significance for preventing cyber - attacks or activities targeting specific targets. In this way, network security incidents can be better handled, enterprises and citizens can be protected, and early warnings can be issued in a timely manner. The main contributions of the paper include: 1. Using the hierarchical clustering algorithm to analyze and investigate the text part of spam emails and divide them into categories based on network security topics. 2. Proposing an email pre - processing method for extracting the text content in spam emails, especially considering two techniques commonly used by spam senders: (i) embedding part (or all) of the spam message in an image; (ii) hiding random text (called "salting") in the body of the email. 3. Creating a new dataset, called Spam Email Multiclassification (SPEMC), which is divided into two subsets: one contains English spam emails, and the other contains Spanish spam emails. Each subset contains approximately 15,000 spam emails, which are labeled as 11 predefined categories. 4. Introducing a framework that uses machine learning and natural language processing techniques to classify spam into network security categories. This framework can be integrated into tools and services aimed at serving citizens and organizations to help them identify harmful spam, such as ransom hackers, false rewards, identity fraud or false job opportunities, etc. In addition, in order to extract all valuable text from spam emails, the paper also detects two common techniques of spam senders in the dataset and proposes solutions to reduce their impact on classification. For spam emails containing images, use OCR technology to extract text instead of ignoring it; for hidden interfering text, instead of looking at HTML tags, use OCR technology to extract user - visible text. These methods help to improve the accuracy and efficiency of spam multi - classification.

Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach

Classification of Spam Emails through Hierarchical Clustering and Supervised Learning

Bio-Inspired Algorithm Based Undersampling Approach and Ensemble Learning for Twitter Spam Detection

A review of spam email detection: analysis of spammer strategies and the dataset shift problem

An Optimized Approach for Detection and Classification of Spam Email's Using Ensemble Methods

Application of Natural Language Processing and Machine Learning Boosted with Swarm Intelligence for Spam Email Filtering

Building an Effective Email Spam Classification Model with spaCy

Detecting Spammers via Aggregated Historical Data Set

Email Classification Using Behavior and Time Features

Spam SMS filtering based on text features and supervised machine learning techniques

A Late Multi-Modal Fusion Model for Detecting Hybrid Spam E-mail

Detecting Spam E-mails with Content and Weight-based Binomial Logistic Model

Effective spam filter based on a hybrid method of header checking and content parsing

Platelet concentrates for bone regeneration: Current evidence and future challenges

Detecting ham and spam emails using feature union and supervised machine learning models

Precision in Classification: A Comparative Study of Logistic Regression, Naive Bayes, LSTM, and CNN for Spam Email Detection

Spam-T5: Benchmarking Large Language Models for Few-Shot Email Spam Detection

Ensemble Decision for Spam Detection Using Term Space Partition Approach

Email spam detection by deep learning models using novel feature selection technique and BERT

Feature Construction Approach for Email Categorization Based on Term Space Partition

Octave convolutional multi-head capsule nutcracker network with oppositional Kepler algorithm based spam email detection