Abstract:The growing problem of unsolicited text messages (smishing) and data irregularities necessitates stronger spam detection solutions. This paper explores the development of a sophisticated model designed to identify smishing messages by understanding the complex relationships among words, images, and context-specific factors, areas that remain underexplored in existing research. To address this, we merge a UCI spam dataset of regular text messages with real-world spam data, leveraging OCR technology for comprehensive analysis. The study employs a combination of traditional machine learning models, including K-means, Non-Negative Matrix Factorization, and Gaussian Mixture Models, along with feature extraction techniques such as TF_IDF and PCA. Additionally, deep learning models like RNN-Flatten, LSTM, and Bi-LSTM are utilized. The selection of these models is driven by their complementary strengths in capturing both the linear and non-linear relationships inherent in smishing messages. Machine learning models are chosen for their efficiency in handling structured text data, while deep learning models are selected for their superior ability to capture sequential dependencies and contextual nuances. The performance of these models is rigorously evaluated using metrics like accuracy, precision, recall, and F1 score, enabling a comparative analysis between the machine learning and deep learning approaches. Notably, the K-means feature extraction with vectorizer achieved 91.01% accuracy, and the KNN-Flatten model reached 94.13% accuracy, emerging as the top performer. The rationale behind highlighting these models is their potential to significantly improve smishing detection rates. For instance, the high accuracy of the KNN-Flatten model suggests its applicability in real-time spam detection systems, but its computational complexity might limit scalability in large-scale deployments. Similarly, while K-means with vectorizer excels in accuracy, it may struggle with the dynamic and evolving nature of smishing attacks, necessitating continual retraining.

Attention Mechanism and Support Vector Machine for Image-Based E-Mail Spam Filtering

A Late Multi-Modal Fusion Model for Detecting Hybrid Spam E-mail

Development of a Machine Learning Model for Image-based Email Spam Detection

Semantic Graph Based Convolutional Neural Network for Spam e-mail Classification in Cybercrime Applications

Robust multimedia spam filtering based on visual, textual, and audio deep features and random forest

Image Spam Classification Based on Convolutional Neural Network

DeepImageSpam: Deep Learning based Image Spam Detection

Email Spam Detection Using Hierarchical Attention Hybrid Deep Learning Method

Content-based Spam Email Detection Using N-gram Machine Learning Approach

Spam review detection using self attention based CNN and bi-directional LSTM

Image Spam Filtering Using Fourier-Mellin Invariant Features

Application of Natural Language Processing and Machine Learning Boosted with Swarm Intelligence for Spam Email Filtering

SMS Scam Detection Application Based on Optical Character Recognition for Image Data Using Unsupervised and Deep Semi-Supervised Learning

Classify E-mails by Support Vector Machine

Fusion of text and image features: A new approach to image spam filtering

Email spam detection by deep learning models using novel feature selection technique and BERT

HAM: Hybrid Attention Module in Deep Convolutional Neural Networks for Image Classification

Novel method for Chinese spam detection based on one-class support vector machines

Image-Based Malware Classification Using VGG19 Network and Spatial Convolutional Attention

Detecting Spam E-mails with Content and Weight-based Binomial Logistic Model

Hierarchical Spiking-Based Model for Efficient Image Classification with Enhanced Feature Extraction and Encoding.