NG_MDERANK: A software vulnerability feature knowledge extraction method based on N‐gram similarity

Xiaoxue Wu,Shiyu Weng,Bin Zheng,Wei Zheng,Xiang Chen,Xiaobin Sun
DOI: https://doi.org/10.1002/smr.2727
2024-08-30
Journal of Software Evolution and Process
Abstract:Proposing NG_MDERANK, a novel approach for extracting software vulnerability feature knowledge using N‐gram similarity, enhancing detection and analysis of vulnerabilities by identifying multiword phrases critical to security. NG_MDERANK can efficiently and stably analyze samples in environments with large sample sizes and complex samples and can yield high‐value semi‐structured data. Based on the extraction results, the corresponding software vulnerability domain knowledge graph is constructed, which helps to efficiently study software security problems and solve vulnerability problems. As software grows in size and complexity, software vulnerabilities are increasing, leading to a range of serious insecurity issues. Open‐source software vulnerability reports and documentation can provide researchers with great convenience for analysis and detection. However, the quality of different data sources varies, the data are duplicated and lack of correlation, which often requires a lot of manual management and analysis. In order to solve the problems of scattered and heterogeneous data and lack of correlation in traditional vulnerability repositories, this paper proposes a software vulnerability feature knowledge extraction method that combines the N‐gram model and mask similarity. The method generates mask text data based on the extraction of N‐gram candidate keywords and extracts vulnerability feature knowledge by calculating the similarity of mask text. This method analyzes the samples efficiently and stably in the environment of large sample size and complex samples and can obtain high‐value semi‐structured data. Then, the final node, relationship, and attribute information are obtained by secondary knowledge cleaning and extraction of the extracted semi‐structured data results. And based on the extraction results, the corresponding software vulnerability domain knowledge graph is constructed to deeply explore the semantic information features and entity relationships of vulnerabilities, which can help to efficiently study software security problems and solve vulnerability problems. The effectiveness and superiority of the proposed method is verified by comparing it with several traditional keyword extraction algorithms on Common Weakness Enumeration (CWE) and Common Vulnerabilities and Exposures (CVE) vulnerability data.
computer science, software engineering
What problem does this paper attempt to address?