Abstract:Malicious URL (Uniform Resource Locator) classification is a pivotal aspect of Cybersecurity, offering defense against web-based threats. Despite deep learning's promise in this area, its advancement is hindered by two main challenges: the scarcity of comprehensive, open-source datasets and the limitations of existing models, which either lack real-time capabilities or exhibit suboptimal performance. In order to address these gaps, we introduce a novel, multi-class dataset for malicious URL classification, distinguishing between benign, phishing and malicious URLs, named DeepURLBench. The data has been rigorously cleansed and structured, providing a superior alternative to existing datasets. Notably, the multi-class approach enhances the performance of deep learning models, as compared to a standard binary classification approach. Additionally, we propose improvements to string-based URL classifiers, applying these enhancements to URLNet. Key among these is the integration of DNS-derived features, which enrich the model's capabilities and lead to notable performance gains while preserving real-time runtime efficiency-achieving an effective balance for cybersecurity applications.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on two major challenges in the field of malicious URL classification: 1. **Lack of comprehensive open - source datasets**: Existing malicious URL classification datasets have many problems, such as insufficient scale, lack of diversity, poor time - relevance, and not including valuable DNS response data. These problems impede the development of effective and general - purpose malicious URL classification models. 2. **Limitations of existing models**: Current classification models either lack real - time processing capabilities or have less - than - ideal performance. For example, some deep - learning - based models perform well in terms of accuracy but cannot achieve real - time classification; while models that can achieve real - time classification (such as URLNet) have relatively low accuracy. To solve the above problems, the author proposes the following improvement measures: - **Introduce a new multi - category dataset, DeepURLBench**: This dataset not only distinguishes between benign, phishing, and malicious URLs, but also undergoes strict cleaning and structuring, providing a better option than existing datasets. - **Propose a method that combines DNS features and expert - constructed features**: By integrating these non - linguistic features, the model's ability to classify URLs is enhanced while maintaining real - time performance. - **Establish a time - based benchmarking method**: By splitting the test set on a monthly basis, the performance of the model in different time periods is evaluated to address the problem of data distribution changing over time. ### Formula Representation To ensure the correctness and readability of formulas, the following are some formula examples involved in the paper (using Markdown format): - **Entropy Calculation Formula**: \[ H = -\sum_{c \in C} P(c) \log_n P(c) \] where \(C\) is the set of characters and \(P(c)\) is the probability of the occurrence of character \(c\). - **Multi - category Loss Function**: \[ L = L_1+L_2 \] where \(L_1\) and \(L_2\) are two binary cross - entropy loss functions respectively, which are used to distinguish between benign and malicious URLs and specific types of malicious behavior (phishing or malware). Through these improvements, the paper aims to improve the accuracy and real - time performance of malicious URL classification, so as to better cope with the ever - evolving cyber threats.

A New Dataset and Methodology for Malicious URL Classification

Robust Detection of Malicious URLs with Self-Paced Wide & Deep Learning

Towards Fighting Cybercrime: Malicious URL Attack Type Detection using Multiclass Classification

An intelligent identification and classification system for malicious uniform resource locators (URLs)

Malicious URL Detection using Machine Learning: A Survey

An Assessment of Lexical, Network, and Content-Based Features for Detecting Malicious URLs Using Machine Learning and Deep Learning Models

DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification

Cascaded capsule twin attentional dilated convolutional network for malicious URL detection

An ensemble classification method based on machine learning models for malicious Uniform Resource Locators (URL)

Malicious URL Detection via Pretrained Language Model Guided Multi-Level Feature Attention Network

Malicious URL Detection Based on Improved Multilayer Recurrent Convolutional Neural Network Model

Hybrid Machine Learning Approach For Real-Time Malicious Url Detection Using Som-Rmo And Rbfn With Tabu Search Optimization

Detection of Malicious Websites Using Machine Learning Techniques

Malicious URL Detection Based on Associative Classification

URL and Malicious Link Prediction

Malware Analysis Using Machine Learning and Deep Learning Techniques

An Efficient DenseNet-Based Deep Learning Model for Malware Detection

Novel Security Metrics for Identifying Risky Unified Resource Locators (URLs)

Efficient Classification of Malicious URLs: M-BERT—A Modified BERT Variant for Enhanced Semantic Understanding

Machine Learning System for Malicious Website Detection: A Literature Review

Advancing Malicious Website Identification: A Machine Learning Approach Using Granular Feature Analysis