Abstract:The advancement in technology made a significant mark with time, which affects every field of life like medicine, music, office, traveling, and communication. Telephone lines are used as a communication medium in ancient times. Currently, wireless technology overrides telephone wire technology with much broader features. The advertisement agencies and spammers mostly use SMS as a medium of communication to convey their business brochures to the typical person. Due to this reason, more than 60% of spam SMS are received daily. These spam messages cause users' anger and sometimes scam with innocent users, but it creates large profits for the spammer and advertisement companies. This study proposed an approach for the classification of spam and ham SMS using supervised machine learning techniques. The feature extracting techniques such as Term Frequency-Inverse Document Frequency (TF-IDF) and bag-of-words are used to extract features from data. The SMS dataset used was imbalanced, and to solve this problem, we used over-sampling and under-sampling techniques. The support vector classifier, gradient boosting machine, random forest, Gaussian Naive Bayes, and logistics regression are applied on the spam and ham SMS dataset to evaluate the performance using accuracy, precision, recall, and F1 score. The experiment result shows that the random forest classifies spam ham SMS more accurately with 99% accuracy. The proposed model is trained well to identify the SMS category in terms of Ham or Spam with TF-IDF features and oversampling technique. The performance of the proposed approach was also evaluated on the spam email dataset with significant 99% accuracy.

The Improved Logistic Regression Models for Spam Filtering

Filtering Chinese Spam Email Using Logistic Regression

Information Theory Based Feature Valuing for Logistic Regression for Spam Filtering

Joint NLP Lab between HIT 2 at CEAS Spam-filter Challenge 2008

Detecting Spam E-mails with Content and Weight-based Binomial Logistic Model

Online Linear Discriminative Learning for Spam Filter

Incremental Information Gain Analysis of Input Attribute Impact on RBF-kernel SVM Spam Detection.

A Hybrid Model with New Word Weighting for Fast Filtering Spam Short Texts

Simplified Chinese spam mail filter:design and performance evaluation

Dynamic Rules' Score Adjustment In Spam Filter Using Users' Feedback

Efficient Modeling of Spam Images

Novel method for Chinese spam detection based on one-class support vector machines

Application of Natural Language Processing and Machine Learning Boosted with Swarm Intelligence for Spam Email Filtering

PRIS Kidult Anti-SPAM Solution at the TREC 2005 Spam Track: Improving the Performance of Naive Bayes for Spam Detection.

A Local-Concentration-Based Feature Extraction Approach for Spam Filtering.

SVM-Based Spam Filter with Active and Online Learning.

A Spam Filtering Method Based on Multi-Modal Fusion

Research on the Characteristic of Partial Dependency for Spam Classification

Spam SMS filtering based on text features and supervised machine learning techniques

Implementation and Evaluation of Chinese Spam Filtering System

Intelligent Detection Approaches for Spam