Mitigating Data Imbalance for Software Vulnerability Assessment: Does Data Augmentation Help?

Triet H. M. Le,M. Ali Babar
2024-07-15
Abstract:Background: Software Vulnerability (SV) assessment is increasingly adopted to address the ever-increasing volume and complexity of SVs. Data-driven approaches have been widely used to automate SV assessment tasks, particularly the prediction of the Common Vulnerability Scoring System (CVSS) metrics such as exploitability, impact, and severity. SV assessment suffers from the imbalanced distributions of the CVSS classes, but such data imbalance has been hardly understood and addressed in the literature. Aims: We conduct a large-scale study to quantify the impacts of data imbalance and mitigate the issue for SV assessment through the use of data augmentation. Method: We leverage nine data augmentation techniques to balance the class distributions of the CVSS metrics. We then compare the performance of SV assessment models with and without leveraging the augmented data. Results: Through extensive experiments on 180k+ real-world SVs, we show that mitigating data imbalance can significantly improve the predictive performance of models for all the CVSS tasks, by up to 31.8% in Matthews Correlation Coefficient. We also discover that simple text augmentation like combining random text insertion, deletion, and replacement can outperform the baseline across the board. Conclusions: Our study provides the motivation and the first promising step toward tackling data imbalance for effective SV assessment.
Software Engineering,Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the issue of data imbalance in software vulnerability assessment. Specifically, software vulnerability (SV) assessment encounters the problem of uneven category distribution when automatically predicting Common Vulnerability Scoring System (CVSS) metrics such as exploitability, impact, and severity. This phenomenon of data imbalance has rarely been studied and addressed in existing literature. Therefore, through large-scale research, the paper explores the effectiveness of data augmentation techniques in quantifying and mitigating the impact of data imbalance on software vulnerability assessment. ### Main Research Content: 1. **Background and Motivation**: - Software vulnerability assessment is crucial in software lifecycle management for identifying and prioritizing potential security threats. - CVSS is the most commonly used industry standard, but manually assigning CVSS metrics is time-consuming and cumbersome. - The issue of data imbalance is prevalent in CVSS metrics but has not been adequately studied. 2. **Research Objectives**: - Balance the distribution of CVSS categories through data augmentation techniques to improve the performance of software vulnerability assessment models. - Compare the performance of models with and without data augmentation techniques to evaluate the effectiveness of data augmentation. 3. **Methods**: - Collect data on 180,087 real-world software vulnerabilities and use nine data augmentation techniques to generate new descriptions. - Train models using common machine learning and deep learning techniques to evaluate the predictive performance of different CVSS metrics. 4. **Results**: - Experimental results show that by mitigating data imbalance, the predictive performance of the models significantly improves, with the Matthews correlation coefficient (MCC) reaching up to 31.8%. - Simple text augmentation techniques, such as random insertion, deletion, and replacement, outperform the baseline models. 5. **Conclusion**: - This study provides preliminary and promising steps to address the data imbalance issue, helping to improve the accuracy of software vulnerability assessment. - The code and models have been made public to facilitate future research and application. ### Key Contributions: - **Systematic Study**: For the first time, systematically studied the importance and impact of data augmentation in mitigating the data imbalance issue in software vulnerability assessment. - **Benchmarking**: Through empirical analysis, evaluated the effectiveness of different data augmentation techniques and identified the best combination of techniques. - **Resource Sharing**: Shared the code and models to enable other researchers to reproduce the results and further explore. ### Summary: Through large-scale experiments, this paper demonstrates the potential of data augmentation techniques in mitigating the data imbalance issue in software vulnerability assessment, providing important references for improving the performance of assessment models. These findings not only contribute to academic research but also provide valuable guidance for software vulnerability management and remediation in practical applications.