Abstract:As cyber-attacks grow fast and complicated, the cybersecurity industry faces challenges to utilize state-of-the-art technology and strategies to battle the consistently present malicious threats. Phishing is a sort of social engineering attack produced technically and classified as identity theft and complicated attack vectors to steal information of internet users. In this perspective, our main objective of this study is to propose a unique, robust ensemble machine learning model architecture that provides the highest prediction accuracy with a low error rate while proposing few other robust machine learning models. Both supervised and unsupervised techniques were used for the detection process. For our experiments, seven classification algorithms, one clustering algorithm, two ensemble techniques, and two large standard legitimate datasets with 73,575 URLs and 100,000 URLs were used. Two test modes (percentage split, K-Fold cross-validation) were utilized for conducting experiments and final predictions. Mechanisms were developed to (I) identify the best $N$ , which is the optimal heuristic-based threshold value for splitting words into subwords for each classifier, (II) tune hyperparameters for each classifier to specify the best parameter combination, (III) select prominent features using various feature selection techniques, (IV) propose a robust ensemble model (classifier) called the Expandable Random Gradient Stacked Voting Classifier (ERG-SVC) utilizing a voting classifier along with a model architecture, (V) analyze possible clusters of the dataset using k-means clustering, (VI) thoroughly analyze the gradient boost classifier (GB) with respect to utilizing the “criterion” parameter with the Mean Absolute Error (MAE), Mean Squared Error (MSE), and Friendman_MSE, and(VII) propose a lightweight preprocessor to reduce computational cost and preprocessing time. Initial experiments were carried out with 46 features; the number of features was reduced to 22 after the experiments. The results show that the GB classifier outperformed with the least number of NLP based features by achieving a 98.118% prediction accuracy. Furthermore, our stacking ensemble model and proposed voting ensemble model (ERG-SVC) outperformed other tested approaches and yielded reliable prediction accuracy results in detecting malicious URLs at rates of 98.23% and 98.27%, respectively.

Comparative Study of CatBoost, XGBoost, and LightGBM for Enhanced URL Phishing Detection: A Performance Assessment

Enhancing Phishing Detection through Feature Importance Analysis and Explainable AI: A Comparative Study of CatBoost, XGBoost, and EBM Models

Analysis of the Performance Impact of Fine-Tuned Machine Learning Model for Phishing URL Detection

Comparative evaluation of machine learning algorithms for phishing site detection

Light gradient boosting machine-based phishing webpage detection model using phisher website features of mimic URLs

Improving Phishing Website Detection Using a Hybrid Two-level Framework for Feature Selection and XGBoost Tuning

Phishing website detection: How effective are deep learning‐based models and hyperparameter optimization?

An effective detection approach for phishing websites using URL and HTML features

PhishGuard: A Multi-Layered Ensemble Model for Optimal Phishing Website Detection

Comparison of the efficiency of machine learning algorithms for phishing detection from uniform resource locator

Phishing website detection using support vector machines and nature-inspired optimization algorithms

AI Meta-Learners and Extra-Trees Algorithm for the Detection of Phishing Websites

Optimized URL Feature Selection Based on Genetic-Algorithm-Embedded Deep Learning for Phishing Website Detection

SecureNet: A Comparative Study of DeBERTa and Large Language Models for Phishing Detection

Phishpedia: A Hybrid Deep Learning Based Approach to Visually Identify Phishing Webpages

PhishMatch: A Layered Approach for Effective Detection of Phishing URLs

PDSMV3-DCRNN: A Novel Ensemble Deep Learning Framework for Enhancing Phishing Detection and URL Extraction

Robust Ensemble Machine Learning Model for Filtering Phishing URLs: Expandable Random Gradient Stacked Voting Classifier (ERG-SVC)

Ensemble Model for Detecting Phishing and Trojan using Latest Machine Learning Technique

Advancing Phishing Email Detection: A Comparative Study of Deep Learning Models

The Performance of Sequential Deep Learning Models in Detecting Phishing Websites Using Contextual Features of URLs