Abstract:The expansion of the Internet has led to the widespread proliferation of malicious URLs, becoming a primary vector for cyber threats. Detecting malicious URLs is now essential for improving network security. The technological revolution spurred by pre-trained language models holds great promise for advancing the detection of malicious URLs. However, current research applying these models to URLs fails to address several crucial factors, including the lack of domain-specific adaptability, the omission of character-level information, and the neglect of both local detail extraction and low-order encoding information. In this paper, we propose PMANet, a pre-trained Language Model-Guided multi-level feature attention network, for addressing these issues. To facilitate a smooth transition of the pre-trained Transformer into the URL domain and to enable it to effectively capture information at both subword and character levels, we propose a post-training program that continues training the model on URLs using three self-supervised learning objectives: masked language model, noisy language model, and domain discrimination task. Subsequently, we develop a module to capture the output of each encoding layer, thus extracting hierarchical representations of URLs spanning from low-level to high-level. In addition, we propose a layer-wise attention mechanism that dynamically assigns weight coefficients to these feature layers based on their relevance. Finally, we apply spatial pyramid pooling to perform multi-scale down-sampling in order to obtain both local features and global context. PMANet achieves multifaceted integration in URL feature extraction, including capturing information at both the lexical and character levels, extracting features from low to high order, and discerning patterns at both global and local scales. We evaluate PMANet against challenging real-world scenarios, such as small-scale data, class imbalance, cross-dataset, adversarial attacks, and case studies on active malicious URLs. All experiments demonstrate that PMANet exhibits superiority over both the previous state-of-the-art pre-trained models and conventional deep learning models. Specifically, PMANet still achieves a 0.9941 AUC under adversarial attacks and correctly identifies all 20 actively malicious URLs in the case study. The code and data for our research are available at: https://github.com/Alixyvtte/Malicious-URL-Detection-PMANet.

Efficient Classification of Malicious URLs: M-BERT—A Modified BERT Variant for Enhanced Semantic Understanding

Malicious URL Detection via Pretrained Language Model Guided Multi-Level Feature Attention Network

DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification

MalBERT: Using Transformers for Cybersecurity and Malicious Software Detection

TransURL: Improving Malicious URL Detection with Multi-Layer Transformer Encoding and Multi-Scale Pyramid Features

An intelligent identification and classification system for malicious uniform resource locators (URLs)

Advancing Malicious Website Identification: A Machine Learning Approach Using Granular Feature Analysis

PMANet: Malicious URL Detection Via Post-Trained Language Model Guided Multi-Level Feature Attention Network

TransURL

Applying Pre-trained Multilingual BERT in Embeddings for Improved Malicious Prompt Injection Attacks Detection

An ensemble classification method based on machine learning models for malicious Uniform Resource Locators (URL)

Malicious URL Detection Based on Improved Multilayer Recurrent Convolutional Neural Network Model

PyraTrans: Attention-Enriched Pyramid Transformer for Malicious URL Detection

Detection of Malicious Websites Using Machine Learning Techniques

Detecting Android Malware: From Neural Embeddings to Hands-On Validation with BERTroid

DarkBERT: A Language Model for the Dark Side of the Internet

PyraTrans: Learning Attention-Enriched Multi-Scale Pyramid Network from Pre-Trained Transformers for Effective Malicious URL Detection

Revolutionizing Cyber Threat Detection With Large Language Models: A Privacy-Preserving BERT-Based Lightweight Model for IoT/IIoT Devices

SecureBERT: A Domain-Specific Language Model for Cybersecurity

URLBERT:A Contrastive and Adversarial Pre-trained Model for URL Classification

Robust Detection of Malicious URLs with Self-Paced Wide & Deep Learning