A New Dataset and Methodology for Malicious URL Classification

Ilan Schvartzman,Roei Sarussi,Maor Ashkenazi,Ido kringel,Yaniv Tocker,Tal Furman Shohet
2024-12-31
Abstract:Malicious URL (Uniform Resource Locator) classification is a pivotal aspect of Cybersecurity, offering defense against web-based threats. Despite deep learning's promise in this area, its advancement is hindered by two main challenges: the scarcity of comprehensive, open-source datasets and the limitations of existing models, which either lack real-time capabilities or exhibit suboptimal performance. In order to address these gaps, we introduce a novel, multi-class dataset for malicious URL classification, distinguishing between benign, phishing and malicious URLs, named DeepURLBench. The data has been rigorously cleansed and structured, providing a superior alternative to existing datasets. Notably, the multi-class approach enhances the performance of deep learning models, as compared to a standard binary classification approach. Additionally, we propose improvements to string-based URL classifiers, applying these enhancements to URLNet. Key among these is the integration of DNS-derived features, which enrich the model's capabilities and lead to notable performance gains while preserving real-time runtime efficiency-achieving an effective balance for cybersecurity applications.
Machine Learning,Cryptography and Security
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on two major challenges in the field of malicious URL classification: 1. **Lack of comprehensive open - source datasets**: Existing malicious URL classification datasets have many problems, such as insufficient scale, lack of diversity, poor time - relevance, and not including valuable DNS response data. These problems impede the development of effective and general - purpose malicious URL classification models. 2. **Limitations of existing models**: Current classification models either lack real - time processing capabilities or have less - than - ideal performance. For example, some deep - learning - based models perform well in terms of accuracy but cannot achieve real - time classification; while models that can achieve real - time classification (such as URLNet) have relatively low accuracy. To solve the above problems, the author proposes the following improvement measures: - **Introduce a new multi - category dataset, DeepURLBench**: This dataset not only distinguishes between benign, phishing, and malicious URLs, but also undergoes strict cleaning and structuring, providing a better option than existing datasets. - **Propose a method that combines DNS features and expert - constructed features**: By integrating these non - linguistic features, the model's ability to classify URLs is enhanced while maintaining real - time performance. - **Establish a time - based benchmarking method**: By splitting the test set on a monthly basis, the performance of the model in different time periods is evaluated to address the problem of data distribution changing over time. ### Formula Representation To ensure the correctness and readability of formulas, the following are some formula examples involved in the paper (using Markdown format): - **Entropy Calculation Formula**: \[ H = -\sum_{c \in C} P(c) \log_n P(c) \] where \(C\) is the set of characters and \(P(c)\) is the probability of the occurrence of character \(c\). - **Multi - category Loss Function**: \[ L = L_1+L_2 \] where \(L_1\) and \(L_2\) are two binary cross - entropy loss functions respectively, which are used to distinguish between benign and malicious URLs and specific types of malicious behavior (phishing or malware). Through these improvements, the paper aims to improve the accuracy and real - time performance of malicious URL classification, so as to better cope with the ever - evolving cyber threats.