Abstract:The widespread accessibility of the Internet has led to a surge in online fraudulent activities, underscoring the necessity of shielding users' sensitive information from cybercriminals. Phishing, a well-known cyberattack, revolves around the creation of phishing webpages and the dissemination of corresponding URLs, aiming to deceive users into sharing their sensitive information, often for identity theft or financial gain. Various techniques are available for preemptively categorizing zero-day phishing URLs by distilling unique attributes and constructing predictive models. However, these existing techniques encounter unresolved issues. This proposal delves into persistent challenges within phishing detection solutions, particularly concentrated on the preliminary phase of assembling comprehensive datasets, and proposes a potential solution in the form of a tool engineered to alleviate bias in ML models. Such a tool can generate phishing webpages for any given set of legitimate URLs, infusing randomly selected content and visual-based phishing features. Furthermore, we contend that the tool holds the potential to assess the efficacy of existing phishing detection solutions, especially those trained on confined datasets.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the bias problem faced by existing machine learning (ML) models when detecting phishing web pages, especially in the initial stage of dataset construction. Specifically, the paper focuses on the following two main issues: 1. **Imbalanced Dataset**: In existing datasets used to train ML models, the number of legitimate URLs and phishing URLs is often disproportionate, resulting in bias in the model during classification. This bias makes the model more likely to predict the majority class (usually legitimate URLs), thereby reducing the recognition accuracy for new (zero - day) phishing attacks. 2. **Lack of Diversity**: Many researchers collect phishing samples from a single repository, which limits the diversity of the dataset and may cause the model to be unable to recognize different types of phishing features. To improve the accuracy of phishing detection, a dataset containing multiple phishing features is required, including URL and web page content features (such as logos, icons, HTML, CSS, and JS/PHP code, etc.). To solve these problems, the paper proposes a tool that can generate corresponding phishing web pages from a given set of legitimate URLs and randomly add content and visual - based phishing features. Through this method, the generated dataset can be balanced between legitimate and phishing web pages and has higher diversity. In addition, this tool can also be used to evaluate the effectiveness of existing phishing detection solutions, especially those models trained on limited datasets. ### Specific Problem Description - **Imbalanced Dataset**: When trained with an imbalanced dataset, the ML model will be biased towards the class with a larger number (usually legitimate URLs), thereby reducing the classification accuracy for the minority class (phishing URLs). \[ \text{Imbalanced Dataset} \to \text{Bias in Model} \to \text{Reduced Accuracy for Minority Class (Phishing URLs)} \] - **Lack of Diversity**: Phishing samples obtained from a single repository may not cover all types of phishing features, resulting in poor performance of the model when encountering new types of phishing attacks. ### Solution Overview The tool proposed in the paper generates phishing web pages through the following steps: 1. **Input**: A legitimate URL and a set of phishing features ($k$). 2. **Processing**: - Visit the legitimate URL and download its source code. - Randomly select and add diverse content and visual - based phishing features to the web page. 3. **Output**: Generate a phishing web page with $k$ phishing features embedded. In this way, the generated dataset not only maintains the balance between legitimate and phishing web pages but also increases the diversity of the dataset, thereby improving the accuracy of ML models in detecting new phishing attacks. ### Conclusion and Future Outlook This research delves into the ongoing challenges in phishing detection solutions, especially in the basic stage of dataset construction. The proposed tool alleviates dataset - related problems by generating phishing web pages, providing new ideas and methods for future phishing detection research. In addition, this tool can also be used to evaluate the effectiveness of existing phishing detection solutions, especially for those models trained on limited datasets.

Mitigating Bias in Machine Learning Models for Phishing Webpage Detection

From ML to LLM: Evaluating the Robustness of Phishing Webpage Detection Models against Adversarial Attacks

Phishpedia: A Hybrid Deep Learning Based Approach to Visually Identify Phishing Webpages

Light gradient boosting machine-based phishing webpage detection model using phisher website features of mimic URLs

Advanced Evasion Attacks and Mitigations on Practical ML-Based Phishing Website Classifiers

A Survey of Machine Learning-Based Solutions for Phishing Website Detection

"Are Adversarial Phishing Webpages a Threat in Reality?" Understanding the Users' Perception of Adversarial Webpages

Detecting Phishing sites Without Visiting them

A Sophisticated Framework for the Accurate Detection of Phishing Websites

Phishing URL Detection using Machine Learning

AI Meta-Learners and Extra-Trees Algorithm for the Detection of Phishing Websites

Towards a Multi-Layered Phishing Detection

Improving Phishing Email Detection Using the Hybrid Machine Learning Approach

Automated Phishing Detection Using URLs and Webpages

Phishing Website Detection through Multi-Model Analysis of HTML Content

An effective detection approach for phishing websites using URL and HTML features

Phishing Detection Leveraging Machine Learning and Deep Learning: A Review

Comparative evaluation of machine learning algorithms for phishing site detection

Analysis of the Performance Impact of Fine-Tuned Machine Learning Model for Phishing URL Detection

Improving Phishing Website Detection Using a Hybrid Two-level Framework for Feature Selection and XGBoost Tuning

Exploring the Efficacy of Federated-Continual Learning Nodes with Attention-Based Classifier for Robust Web Phishing Detection: An Empirical Investigation