Abstract:Today's growing phishing websites pose significant threats due to their extremely undetectable risk. They anticipate internet users to mistake them as genuine ones in order to reveal user information and privacy, such as login ids, pass-words, credit card numbers, etc. without notice. This paper proposes a new approach to solve the anti-phishing problem. The new features of this approach can be represented by URL character sequence without phishing prior knowledge, various hyperlink information, and textual content of the webpage, which are combined and fed to train the XGBoost classifier. One of the major contributions of this paper is the selection of different new features, which are capable enough to detect 0-h attacks, and these features do not depend on any third-party services. In particular, we extract character level Term Frequency-Inverse Document Frequency (TF-IDF) features from noisy parts of HTML and plaintext of the given webpage. Moreover, our proposed hyperlink features determine the relationship between the content and the URL of a webpage. Due to the absence of publicly available large phishing data sets, we needed to create our own data set with 60,252 webpages to validate the proposed solution. This data contains 32,972 benign webpages and 27,280 phishing webpages. For evaluations, the performance of each category of the proposed feature set is evaluated, and various classification algorithms are employed. From the empirical results, it was observed that the proposed individual features are valuable for phishing detection. However, the integration of all the features improves the detection of phishing sites with significant accuracy. The proposed approach achieved an accuracy of 96.76% with only 1.39% false-positive rate on our dataset, and an accuracy of 98.48% with 2.09% false-positive rate on benchmark dataset, which outperforms the existing baseline approaches.

An efficient multistage phishing website detection model based on the CASE feature framework: Aiming at the real web environment

A Sophisticated Framework for the Accurate Detection of Phishing Websites

Protect sensitive sites from phishing attacks using features extractable from inaccessible phishing URLs

An effective detection approach for phishing websites using URL and HTML features

Phishpedia: A Hybrid Deep Learning Based Approach to Visually Identify Phishing Webpages

Web2Vec: Phishing Webpage Detection Method Based on Multidimensional Features Driven by Deep Learning

Phishing Detection Based on Multi-Feature Neural Network.

Phishing Website Detection through Multi-Model Analysis of HTML Content

CCBLA: a Lightweight Phishing Detection Model Based on CNN, BiLSTM, and Attention Mechanism

Multi-scale semantic deep fusion models for phishing website detection

Towards a Multi-Layered Phishing Detection

Automated Phishing Detection Using URLs and Webpages

Intelligent Methods for Accurately Detecting Phishing Websites

PDHF: Effective phishing detection model combining optimal artificial and automatic deep features

The applicability of a hybrid framework for automated phishing detection

Phishing website detection: How effective are deep learning‐based models and hyperparameter optimization?

HinPhish: an Effective Phishing Detection Approach Based on Heterogeneous Information Networks

Evaluation of Online Resources in Assisting Phishing Detection

A Survey of Machine Learning-Based Solutions for Phishing Website Detection

A hybrid DNN-LSTM model for detecting phishing URLs

A stacking model using URL and HTML features for phishing webpage detection