Ignore Me But Don't Replace Me: Utilizing Non-Linguistic Elements for Pretraining on the Cybersecurity Domain

Eugene Jang,Jian Cui,Dayeon Yim,Youngjin Jin,Jin-Woo Chung,Seungwon Shin,Yongjae Lee
2024-04-02
Abstract:Cybersecurity information is often technically complex and relayed through unstructured text, making automation of cyber threat intelligence highly challenging. For such text domains that involve high levels of expertise, pretraining on in-domain corpora has been a popular method for language models to obtain domain expertise. However, cybersecurity texts often contain non-linguistic elements (such as URLs and hash values) that could be unsuitable with the established pretraining methodologies. Previous work in other domains have removed or filtered such text as noise, but the effectiveness of these methods have not been investigated, especially in the cybersecurity domain. We propose different pretraining methodologies and evaluate their effectiveness through downstream tasks and probing tasks. Our proposed strategy (selective MLM and jointly training NLE token classification) outperforms the commonly taken approach of replacing non-linguistic elements (NLEs). We use our domain-customized methodology to train CyBERTuned, a cybersecurity domain language model that outperforms other cybersecurity PLMs on most tasks.
Cryptography and Security,Computation and Language,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the challenges faced by pre - trained language models (PLMs) when dealing with non - language elements (NLEs) in the field of network security. Specifically: 1. **Complexity of Cybersecurity Texts**: Cybersecurity information is usually technically complex and presented in unstructured text, which makes automated cyber - threat intelligence (CTI) very difficult. Traditional pre - training methods may not be able to effectively handle these complex texts. 2. **Processing of Non - language Elements**: Cybersecurity texts often contain non - language elements (such as URLs, hash values, etc.), which are not suitable for existing self - supervised pre - training methods. For example, the masked language model (MLM) task is effective in restoring the natural language part, but may be ineffective or even harmful when restoring non - language elements. 3. **Limitations of Existing Methods**: Previous studies usually simplify texts by replacing or filtering non - language elements, but the effectiveness of this method has not been fully verified in the network security field. In addition, these methods may lose important information content. 4. **Exploring More Effective Pre - training Strategies**: The paper proposes and tests multiple strategies to better utilize non - language elements for pre - training, thereby improving the performance of the model in downstream tasks. Specifically, the author has tried the methods of selective masking and joint training of non - language element classification, and found that these methods are superior to simple replacement strategies. ### Main Contributions - Proposed multiple pre - training strategies for dealing with non - language elements and verified their effectiveness through experiments. - Discovered a strategy that combines selective masking and non - language element classification, which performs well in downstream tasks and probing tasks. - Trained a language model named CyBERTuned in the network security field using this new strategy, which outperforms other similar models in most tasks. - Made the model weights, training resources and code of CyBERTuned public for the research community to use. Through these efforts, the paper provides a more effective method for natural language processing in the network security field, especially in dealing with complex texts and technical information.