Building an Effective Email Spam Classification Model with spaCy

Kazem Taghandiki
DOI: https://doi.org/10.48550/arXiv.2303.08792
2023-03-16
Abstract:Today, people use email services such as Gmail, Outlook, AOL Mail, etc. to communicate with each other as quickly as possible to send information and official letters. Spam or junk mail is a major challenge to this type of communication, usually sent by botnets with the aim of advertising, harming and stealing information in bulk to different people. Receiving unwanted spam emails on a daily basis fills up the inbox folder. Therefore, spam detection is a fundamental challenge, so far many works have been done to detect spam using clustering and text categorisation methods. In this article, the author has used the spaCy natural language processing library and 3 machine learning (ML) algorithms Naive Bayes (NB), Decision Tree C45 and Multilayer Perceptron (MLP) in the Python programming language to detect spam emails collected from the Gmail service. Observations show the accuracy rate (96%) of the Multilayer Perceptron (MLP) algorithm in spam detection.
Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the detection of spam (spam or junk mail) in e - mails. Specifically, the author aims to build an effective e - mail spam classification model by using the natural language processing library spaCy and three machine - learning algorithms (Naive Bayes, decision tree C45, and multi - layer perceptron). The background of this problem is that spam is usually sent by botnets for the purposes of advertising, damaging systems, or stealing information. These mails not only occupy the inbox space of users but may also pose potential security threats to them. Therefore, developing an efficient and accurate spam detection system is of great significance for ensuring the security and efficiency of e - mail communications. In the paper, the author first collected 1,500 mails from the Gmail service, including 750 spam mails and 750 useful mails, as a data set. Then, the spaCy tool was used to pre - process the text content, including removing stop words, numbers, and performing normalization and stemming operations. Finally, the three machine - learning algorithms mentioned above were used to train the model, and the performance of the model was evaluated on the test data. In particular, the accuracy rate of the multi - layer perceptron (MLP) algorithm in spam detection reached 96%.