AntiPhishStack: LSTM-based Stacked Generalization Model for Optimized Phishing URL Detection

Saba Aslam,Hafsa Aslam,Arslan Manzoor,Chen Hui,Abdur Rasool
2024-01-21
Abstract:The escalating reliance on revolutionary online web services has introduced heightened security risks, with persistent challenges posed by phishing despite extensive security measures. Traditional phishing systems, reliant on machine learning and manual features, struggle with evolving tactics. Recent advances in deep learning offer promising avenues for tackling novel phishing challenges and malicious URLs. This paper introduces a two-phase stack generalized model named AntiPhishStack, designed to detect phishing sites. The model leverages the learning of URLs and character-level TF-IDF features symmetrically, enhancing its ability to combat emerging phishing threats. In Phase I, features are trained on a base machine learning classifier, employing K-fold cross-validation for robust mean prediction. Phase II employs a two-layered stacked-based LSTM network with five adaptive optimizers for dynamic compilation, ensuring premier prediction on these features. Additionally, the symmetrical predictions from both phases are optimized and integrated to train a meta-XGBoost classifier, contributing to a final robust prediction. The significance of this work lies in advancing phishing detection with AntiPhishStack, operating without prior phishing-specific feature knowledge. Experimental validation on two benchmark datasets, comprising benign and phishing or malicious URLs, demonstrates the model's exceptional performance, achieving a notable 96.04% accuracy compared to existing studies. This research adds value to the ongoing discourse on symmetry and asymmetry in information security and provides a forward-thinking solution for enhancing network security in the face of evolving cyber threats.
Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges faced by current phishing detection systems. Specifically, traditional phishing detection systems based on machine learning and manual feature extraction have difficulty coping with the constantly evolving phishing attack strategies, especially performing poorly when facing new - type phishing websites. In addition, traditional methods rely on predefined features, which limits their generalization ability and the ability to adapt to new threats. To solve these problems, this paper proposes a two - stage stacked generalization model named **AntiPhishStack**, aiming to optimize the detection of phishing URLs. The main contributions and goals of this model include: 1. **No prior feature knowledge required**: The AntiPhishStack model can automatically learn TF - IDF features at the URL and character levels without predefined phishing features, thereby improving the flexibility and adaptability of detection. 2. **Strong generalization ability**: By using character - level features and combining high - and low - level features in the hidden layers of a multi - layer neural network, the model can generalize more effectively and improve detection accuracy. 3. **Independent of cybersecurity experts and third - party services**: This model independently extracts necessary URL features, reducing the dependence on cybersecurity experts and third - party services (such as page ranking or domain age). ### Model Structure The AntiPhishStack model is divided into two main stages: - **First stage**: Use basic machine - learning classifiers to train features and generate average predictions through K - fold cross - validation. This stage ensures the robustness and generalization ability of the model. - **Second stage**: Adopt a two - layer LSTM - based stacked model and combine five adaptive optimizers for dynamic compilation to ensure the best predictions on these features. In addition, combine the prediction results of the first and second stages and train a meta - XGBoost classifier to finally output the prediction results. ### Experimental Verification The model was experimentally verified on two benchmark data sets, including the URLs of benign websites and phishing websites. The experimental results show that the AntiPhishStack model performs well on multiple evaluation metrics, such as the AUC - ROC curve, precision, recall, F1 - score, mean - squared error (MSE), and accuracy, achieving an accuracy rate of 96.04%, significantly outperforming existing research. ### Summary This research provides an innovative and efficient solution to deal with the ever - evolving phishing threats by introducing the AntiPhishStack model. The model not only improves the accuracy of phishing detection but also shows an in - depth discussion of symmetry and asymmetry in information security, providing a valuable reference for future cybersecurity research. --- If you have more questions or need further information, please feel free to let me know!