Enhancing Phishing Detection through Feature Importance Analysis and Explainable AI: A Comparative Study of CatBoost, XGBoost, and EBM Models

Abdullah Fajar,Setiadi Yazid,Indra Budi
2024-11-11
Abstract:Phishing attacks remain a persistent threat to online security, demanding robust detection methods. This study investigates the use of machine learning to identify phishing URLs, emphasizing the crucial role of feature selection and model interpretability for improved performance. Employing Recursive Feature Elimination, the research pinpointed key features like "length_url," "time_domain_activation" and "Page_rank" as strong indicators of phishing attempts. The study evaluated various algorithms, including CatBoost, XGBoost, and Explainable Boosting Machine, assessing their robustness and scalability. XGBoost emerged as highly efficient in terms of runtime, making it well-suited for large datasets. CatBoost, on the other hand, demonstrated resilience by maintaining high accuracy even with reduced features. To enhance transparency and trustworthiness, Explainable AI techniques, such as SHAP, were employed to provide insights into feature importance. The study's findings highlight that effective feature selection and model interpretability can significantly bolster phishing detection systems, paving the way for more efficient and adaptable defenses against evolving cyber threats
Cryptography and Security,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the continuous threat posed by current phishing attacks to network security, especially the deficiency of traditional signature - based detection methods in identifying newly - created phishing websites (i.e., zero - day attacks). Therefore, the focus of the research is to develop machine - learning - based methods to accurately and efficiently detect phishing websites by selecting the most relevant features from large and complex datasets. Specifically, this research aims to answer the following key questions: 1. How can the feature selection method reduce the number of features while improving the efficiency and accuracy of the machine - learning model in detecting phishing websites? 2. Which machine - learning algorithms perform best in phishing detection after combining with effective feature selection techniques? 3. How can we use Explainable AI (XAI) methods to clearly identify the most influential features for phishing detection and better understand the impact of these features on model predictions? To achieve these goals, this research adopts the following main methods and techniques: - Feature selection: Through methods such as Recursive Feature Elimination (RFE), key features such as "URL length", "domain name activation time" and "web page ranking" are determined. - Machine - learning algorithm evaluation: The robustness and scalability of multiple algorithms such as CatBoost, XGBoost and Explainable Boosting Machine (EBM) are compared. - Model interpretability: XAI techniques such as SHAP (SHapley Additive Explanations) are used to provide insights into feature importance, so as to enhance transparency and credibility. Through these methods, the research hopes to significantly enhance the performance of phishing detection systems and provide more effective and adaptable defense measures against the ever - evolving network threats.