Abstract:Cybersecurity information is often technically complex and relayed through unstructured text, making automation of cyber threat intelligence highly challenging. For such text domains that involve high levels of expertise, pretraining on in-domain corpora has been a popular method for language models to obtain domain expertise. However, cybersecurity texts often contain non-linguistic elements (such as URLs and hash values) that could be unsuitable with the established pretraining methodologies. Previous work in other domains have removed or filtered such text as noise, but the effectiveness of these methods have not been investigated, especially in the cybersecurity domain. We propose different pretraining methodologies and evaluate their effectiveness through downstream tasks and probing tasks. Our proposed strategy (selective MLM and jointly training NLE token classification) outperforms the commonly taken approach of replacing non-linguistic elements (NLEs). We use our domain-customized methodology to train CyBERTuned, a cybersecurity domain language model that outperforms other cybersecurity PLMs on most tasks.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the challenges faced by pre - trained language models (PLMs) when dealing with non - language elements (NLEs) in the field of network security. Specifically: 1. **Complexity of Cybersecurity Texts**: Cybersecurity information is usually technically complex and presented in unstructured text, which makes automated cyber - threat intelligence (CTI) very difficult. Traditional pre - training methods may not be able to effectively handle these complex texts. 2. **Processing of Non - language Elements**: Cybersecurity texts often contain non - language elements (such as URLs, hash values, etc.), which are not suitable for existing self - supervised pre - training methods. For example, the masked language model (MLM) task is effective in restoring the natural language part, but may be ineffective or even harmful when restoring non - language elements. 3. **Limitations of Existing Methods**: Previous studies usually simplify texts by replacing or filtering non - language elements, but the effectiveness of this method has not been fully verified in the network security field. In addition, these methods may lose important information content. 4. **Exploring More Effective Pre - training Strategies**: The paper proposes and tests multiple strategies to better utilize non - language elements for pre - training, thereby improving the performance of the model in downstream tasks. Specifically, the author has tried the methods of selective masking and joint training of non - language element classification, and found that these methods are superior to simple replacement strategies. ### Main Contributions - Proposed multiple pre - training strategies for dealing with non - language elements and verified their effectiveness through experiments. - Discovered a strategy that combines selective masking and non - language element classification, which performs well in downstream tasks and probing tasks. - Trained a language model named CyBERTuned in the network security field using this new strategy, which outperforms other similar models in most tasks. - Made the model weights, training resources and code of CyBERTuned public for the research community to use. Through these efforts, the paper provides a more effective method for natural language processing in the network security field, especially in dealing with complex texts and technical information.

Ignore Me But Don't Replace Me: Utilizing Non-Linguistic Elements for Pretraining on the Cybersecurity Domain

SecureBERT: A Domain-Specific Language Model for Cybersecurity

CySecBERT: A Domain-Adapted Language Model for the Cybersecurity Domain

Out of the Cage: How Stochastic Parrots Win in Cyber Security Environments

CIPHER: Cybersecurity Intelligent Penetration-Testing Helper for Ethical Researcher

SEvenLLM: Benchmarking, Eliciting, and Enhancing Abilities of Large Language Models in Cyber Threat Intelligence

A Robust Cybersecurity Topic Classification Tool

Mitigating Complex Cyber Threats: An Integrated Multimodal Deep Learning Framework for Enhanced Security

CyberPal.AI: Empowering LLMs with Expert-Driven Cybersecurity Instructions

LSTM Recurrent Neural Networks for Cybersecurity Named Entity Recognition

Advancing TTP Analysis: Harnessing the Power of Large Language Models with Retrieval Augmented Generation

Using Large Language Models for Cybersecurity Capture-The-Flag Challenges and Certification Questions

Collecting Indicators of Compromise from Unstructured Text of Cybersecurity Articles using Neural-Based Sequence Labelling

Text Command Intelligent Understanding for Cybersecurity Testing

SecureNet: A Comparative Study of DeBERTa and Large Language Models for Phishing Detection

The Best Defense is a Good Offense: Countering LLM-Powered Cyberattacks

On the Uses of Large Language Models to Interpret Ambiguous Cyberattack Descriptions

Text Laundering: Mitigating Malicious Features Through Knowledge Distillation of Large Foundation Models.

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

CS-Eval: A Comprehensive Large Language Model Benchmark for CyberSecurity

Malicious URL Detection via Pretrained Language Model Guided Multi-Level Feature Attention Network