Enhancing Phishing Detection through Feature Importance Analysis and Explainable AI: A Comparative Study of CatBoost, XGBoost, and EBM Models

Abdullah Fajar,Setiadi Yazid,Indra Budi

2024-11-11

Abstract:Phishing attacks remain a persistent threat to online security, demanding robust detection methods. This study investigates the use of machine learning to identify phishing URLs, emphasizing the crucial role of feature selection and model interpretability for improved performance. Employing Recursive Feature Elimination, the research pinpointed key features like "length_url," "time_domain_activation" and "Page_rank" as strong indicators of phishing attempts. The study evaluated various algorithms, including CatBoost, XGBoost, and Explainable Boosting Machine, assessing their robustness and scalability. XGBoost emerged as highly efficient in terms of runtime, making it well-suited for large datasets. CatBoost, on the other hand, demonstrated resilience by maintaining high accuracy even with reduced features. To enhance transparency and trustworthiness, Explainable AI techniques, such as SHAP, were employed to provide insights into feature importance. The study's findings highlight that effective feature selection and model interpretability can significantly bolster phishing detection systems, paving the way for more efficient and adaptable defenses against evolving cyber threats

Cryptography and Security,Artificial Intelligence

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the continuous threat posed by current phishing attacks to network security, especially the deficiency of traditional signature - based detection methods in identifying newly - created phishing websites (i.e., zero - day attacks). Therefore, the focus of the research is to develop machine - learning - based methods to accurately and efficiently detect phishing websites by selecting the most relevant features from large and complex datasets. Specifically, this research aims to answer the following key questions: 1. How can the feature selection method reduce the number of features while improving the efficiency and accuracy of the machine - learning model in detecting phishing websites? 2. Which machine - learning algorithms perform best in phishing detection after combining with effective feature selection techniques? 3. How can we use Explainable AI (XAI) methods to clearly identify the most influential features for phishing detection and better understand the impact of these features on model predictions? To achieve these goals, this research adopts the following main methods and techniques: - Feature selection: Through methods such as Recursive Feature Elimination (RFE), key features such as "URL length", "domain name activation time" and "web page ranking" are determined. - Machine - learning algorithm evaluation: The robustness and scalability of multiple algorithms such as CatBoost, XGBoost and Explainable Boosting Machine (EBM) are compared. - Model interpretability: XAI techniques such as SHAP (SHapley Additive Explanations) are used to provide insights into feature importance, so as to enhance transparency and credibility. Through these methods, the research hopes to significantly enhance the performance of phishing detection systems and provide more effective and adaptable defense measures against the ever - evolving network threats.

Enhancing Phishing Detection through Feature Importance Analysis and Explainable AI: A Comparative Study of CatBoost, XGBoost, and EBM Models

Comparative Study of CatBoost, XGBoost, and LightGBM for Enhanced URL Phishing Detection: A Performance Assessment

Improving Phishing Website Detection Using a Hybrid Two-level Framework for Feature Selection and XGBoost Tuning

Analysis of the Performance Impact of Fine-Tuned Machine Learning Model for Phishing URL Detection

Can Features for Phishing URL Detection Be Trusted Across Diverse Datasets? A Case Study with Explainable AI

Comparative Analysis of Black-Box and White-Box Machine Learning Model in Phishing Detection

Light gradient boosting machine-based phishing webpage detection model using phisher website features of mimic URLs

An effective detection approach for phishing websites using URL and HTML features

Comparative evaluation of machine learning algorithms for phishing site detection

PhishGuard: A Multi-Layered Ensemble Model for Optimal Phishing Website Detection

Phishing website detection: How effective are deep learning‐based models and hyperparameter optimization?

Phishpedia: A Hybrid Deep Learning Based Approach to Visually Identify Phishing Webpages

AI Meta-Learners and Extra-Trees Algorithm for the Detection of Phishing Websites

Investigation of Phishing Susceptibility with Explainable Artificial Intelligence

Improving Phishing Email Detection Using the Hybrid Machine Learning Approach

Optimized URL Feature Selection Based on Genetic-Algorithm-Embedded Deep Learning for Phishing Website Detection

Mitigating Bias in Machine Learning Models for Phishing Webpage Detection

Robust Ensemble Machine Learning Model for Filtering Phishing URLs: Expandable Random Gradient Stacked Voting Classifier (ERG-SVC)

Phishing website detection using support vector machines and nature-inspired optimization algorithms

Novel Interpretable and Robust Web-based AI Platform for Phishing Email Detection

PhishGuard: A Convolutional Neural Network Based Model for Detecting Phishing URLs with Explainability Analysis