SEDAC: A CVAE-Based Data Augmentation Method for Security Bug Report Identification

Y. Liao,T. Zhang
2024-01-22
Abstract:Bug tracking systems store many bug reports, some of which are related to security. Identifying those security bug reports (SBRs) may help us predict some security-related bugs and solve security issues promptly so that the project can avoid threats and attacks. However, in the real world, the ratio of security bug reports is severely low; thus, directly training a prediction model with raw data may result in inaccurate results. Faced with the massive challenge of data imbalance, many researchers in the past have attempted to use text filtering or clustering methods to minimize the proportion of non-security bug reports (NSBRs) or apply oversampling methods to synthesize SBRs to make the dataset as balanced as possible. Nevertheless, there are still two challenges to those methods: 1) They ignore long-distance contextual information. 2) They fail to generate an utterly balanced dataset. To tackle these two challenges, we propose SEDAC, a new SBR identification method that generates similar bug report vectors to solve data imbalance problems and accurately detect security bug reports. Unlike previous studies, it first converts bug reports into individual bug report vectors with distilBERT, which are based on word2vec. Then, it trains a generative model through conditional variational auto-encoder (CVAE) to generate similar vectors with security labels, which makes the number of SBRs equal to NSBRs'. Finally, balanced data are used to train a security bug report classifier. To evaluate the effectiveness of our framework, we conduct it on 45,940 bug reports from Chromium and four Apache projects. The experimental results show that SEDAC outperforms all the baselines in g-measure with improvements of around 14.24%-50.10%.
Cryptography and Security,Software Engineering
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the identification of security bug reports (SBRs), especially in the case of data imbalance. Specifically: 1. **Data Imbalance Problem**: In practical applications, the number of security bug reports (SBRs) is far less than that of non - security bug reports (NSBRs). This data imbalance will lead to inaccurate results when directly using the original data to train the prediction model. Traditional methods such as text filtering or clustering, although attempting to reduce the proportion of NSBRs or synthesize SBRs through over - sampling, still cannot generate a completely balanced data set. 2. **Ignoring of Long - distance Contextual Information**: Previous studies mainly used tf - idf and word2vec to represent bug reports, and these two methods ignored the long - distance dependency relationships between words, resulting in information loss. To address the above two challenges, this paper proposes a new data augmentation method SEDAC (Security bug report identification framework composed of DistilBERT And Conditional variation autoencoder) based on conditional variational autoencoder (CVAE) for the identification of security bug reports. The specific workflow of SEDAC is as follows: - **Text Representation Stage**: First, use DistilBERT to convert bug reports into vector representations to capture more abundant text information and long - distance contextual relationships. - **Model Training Stage**: Train a generative model through conditional variational autoencoder (CVAE), which can generate similar vectors according to the given security labels, making the number of SBRs equal to that of NSBRs. - **Data Synthesis Stage**: Use the trained CVAE decoder to generate new SBR vectors, thereby making the data set completely balanced. - **Classification Stage**: Use the balanced data set to train a logistic regression classifier to identify security bug reports. Experimental results show that SEDAC has a significant improvement over existing methods in the g - measure index, with an average improvement of 14.24% - 50.10%, and the pf (false positive rate) in all projects is very low, not exceeding 1%. ### Formula Summary - **Latent Variable Generation Formula of CVAE**: \[ z = m+e^{\sigma}\times u \] where \(m\) is the mean, \(\sigma\) is the standard deviation, and \(u\) is a noise vector randomly sampled from a Gaussian distribution. - **KLD Loss Function**: \[ \text{KLDloss}=-\frac{1}{2}\sum(1 + \sigma - m^{2}-e^{2\sigma}) \] - **Mean Square Error (MSE) Loss Function**: \[ \text{MSEloss}=\frac{1}{768}\sum_{j = 1}^{768}(X^{(i)}-X_{\text{reconstruct}}^{(i)})^{2} \] These formulas ensure the ability of the SEDAC framework to handle data imbalance and generate high - quality synthetic data.