Abstract:Background: Software Vulnerability (SV) assessment is increasingly adopted to address the ever-increasing volume and complexity of SVs. Data-driven approaches have been widely used to automate SV assessment tasks, particularly the prediction of the Common Vulnerability Scoring System (CVSS) metrics such as exploitability, impact, and severity. SV assessment suffers from the imbalanced distributions of the CVSS classes, but such data imbalance has been hardly understood and addressed in the literature. Aims: We conduct a large-scale study to quantify the impacts of data imbalance and mitigate the issue for SV assessment through the use of data augmentation. Method: We leverage nine data augmentation techniques to balance the class distributions of the CVSS metrics. We then compare the performance of SV assessment models with and without leveraging the augmented data. Results: Through extensive experiments on 180k+ real-world SVs, we show that mitigating data imbalance can significantly improve the predictive performance of models for all the CVSS tasks, by up to 31.8% in Matthews Correlation Coefficient. We also discover that simple text augmentation like combining random text insertion, deletion, and replacement can outperform the baseline across the board. Conclusions: Our study provides the motivation and the first promising step toward tackling data imbalance for effective SV assessment.

What problem does this paper attempt to address?

The paper aims to address the issue of data imbalance in software vulnerability assessment. Specifically, software vulnerability (SV) assessment encounters the problem of uneven category distribution when automatically predicting Common Vulnerability Scoring System (CVSS) metrics such as exploitability, impact, and severity. This phenomenon of data imbalance has rarely been studied and addressed in existing literature. Therefore, through large-scale research, the paper explores the effectiveness of data augmentation techniques in quantifying and mitigating the impact of data imbalance on software vulnerability assessment. ### Main Research Content: 1. **Background and Motivation**: - Software vulnerability assessment is crucial in software lifecycle management for identifying and prioritizing potential security threats. - CVSS is the most commonly used industry standard, but manually assigning CVSS metrics is time-consuming and cumbersome. - The issue of data imbalance is prevalent in CVSS metrics but has not been adequately studied. 2. **Research Objectives**: - Balance the distribution of CVSS categories through data augmentation techniques to improve the performance of software vulnerability assessment models. - Compare the performance of models with and without data augmentation techniques to evaluate the effectiveness of data augmentation. 3. **Methods**: - Collect data on 180,087 real-world software vulnerabilities and use nine data augmentation techniques to generate new descriptions. - Train models using common machine learning and deep learning techniques to evaluate the predictive performance of different CVSS metrics. 4. **Results**: - Experimental results show that by mitigating data imbalance, the predictive performance of the models significantly improves, with the Matthews correlation coefficient (MCC) reaching up to 31.8%. - Simple text augmentation techniques, such as random insertion, deletion, and replacement, outperform the baseline models. 5. **Conclusion**: - This study provides preliminary and promising steps to address the data imbalance issue, helping to improve the accuracy of software vulnerability assessment. - The code and models have been made public to facilitate future research and application. ### Key Contributions: - **Systematic Study**: For the first time, systematically studied the importance and impact of data augmentation in mitigating the data imbalance issue in software vulnerability assessment. - **Benchmarking**: Through empirical analysis, evaluated the effectiveness of different data augmentation techniques and identified the best combination of techniques. - **Resource Sharing**: Shared the code and models to enable other researchers to reproduce the results and further explore. ### Summary: Through large-scale experiments, this paper demonstrates the potential of data augmentation techniques in mitigating the data imbalance issue in software vulnerability assessment, providing important references for improving the performance of assessment models. These findings not only contribute to academic research but also provide valuable guidance for software vulnerability management and remediation in practical applications.

Mitigating Data Imbalance for Software Vulnerability Assessment: Does Data Augmentation Help?

Towards More Practical Automation of Vulnerability Assessment

Enhancing Pre-Trained Language Models for Vulnerability Detection via Semantic-Preserving Data Augmentation

A Mutation-Based Data Enhancement Approach for Software Vulnerability Detection

Is augmentation effective to improve prediction in imbalanced text datasets?

Mutation‐based data augmentation for software defect prediction

Enhancing Code Vulnerability Detection via Vulnerability-Preserving Data Augmentation

Does data sampling improve deep learning-based vulnerability detection? Yeas! and Nays!

Evaluating the Impact of Data Augmentation on Predictive Model Performance

On the Use of Fine-grained Vulnerable Code Statements for Software Vulnerability Assessment Models

Automatic Data Labeling for Software Vulnerability Prediction Models: How Far Are We?

Boosting Model Resilience via Implicit Adversarial Data Augmentation

Exploring RAG-based Vulnerability Augmentation with LLMs

Data Augmentation Revisited: Rethinking the Distribution Gap between Clean and Augmented Data

Data Augmentation Can Improve Robustness

Smart data augmentation: One equation is all you need

Experimenting with an Evaluation Framework for Imbalanced Data Learning (EFIDL)

Data Augmentation for Sentiment Classification with Semantic Preservation and Diversity

Exploring Empty Spaces: Human-in-the-Loop Data Augmentation

Are Latent Vulnerabilities Hidden Gems for Software Vulnerability Prediction? An Empirical Study

Automated Software Vulnerability Assessment with Concept Drift