Impact of Data Snooping on Deep Learning Models for Locating Vulnerabilities in Lifted Code

Gary A. McCully,John D. Hastings,Shengjie Xu
2024-12-03
Abstract:This study examines the impact of data snooping on neural networks for vulnerability detection in lifted code, building on previous research which used word2vec, and unidirectional and bidirectional transformer-based embeddings. The research specifically focuses on how model performance is affected when embedding models are trained on datasets, including samples also used for neural network training and validation. The results show that introducing data snooping did not significantly alter model performance, suggesting that data snooping had a minimal impact or that samples randomly dropped as part of the methodology contained hidden features critical to achieving optimal performance. In addition, the findings reinforce the conclusions of previous research, which found that models trained with GPT-2 embeddings consistently outperformed neural networks trained with other embeddings. The fact that this holds even when data snooping is introduced into the embedding model indicates GPT-2's robustness in representing complex code features, even under less-than-ideal conditions.
Cryptography and Security,Computation and Language,Machine Learning,Software Engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the impact of data snooping on deep - learning models when detecting vulnerabilities in compiled code. Specifically, the research focuses on how model performance will be affected when the embedding model uses the same samples in training and validating the neural network. By introducing data - snooping conditions, the author explores whether this design flaw will have a significant impact on the performance of the model. ### Research Background - **Security issues of closed - source software**: Many organizations rely on closed - source software, such as Microsoft Windows, Adobe Acrobat Reader, etc. Vulnerabilities in these software may affect system security on a global scale. - **Detecting vulnerabilities in compiled code**: Since closed - source software does not provide source code, researchers attempt to use machine - learning techniques to detect vulnerabilities in compiled code, especially stack - overflow (CWE - 121) vulnerabilities. - **The impact of data snooping**: Data snooping is a design flaw, which refers to using information that is unavailable in reality during model training. This phenomenon may inadvertently have a negative impact on model performance. ### Research Objectives This research aims to evaluate the impact of data snooping on deep - learning models for detecting vulnerabilities in compiled code, especially data snooping in the embedding model layer. The research hopes to reveal the robustness of the embedding model under non - ideal data conditions by introducing data - snooping conditions. ### Main Findings - **Robustness of the GPT - 2 model**: Even when data snooping is introduced, the neural network trained with the GPT - 2 embedding model still performs well, indicating that GPT - 2 has strong robustness in representing complex code features. - **Data snooping has a minor impact on model performance**: The results show that introducing data snooping does not significantly change the model performance, which may mean that the impact of data snooping on the model is limited, or the randomly dropped samples contain some hidden features that are crucial for optimal performance. ### Conclusions This research shows that, despite the introduction of data - snooping conditions, the performance of different embedding models remains relatively stable, especially when using the GPT - 2 embedding model. This indicates that these models have a certain degree of robustness in the face of data pollution and noise and can maintain reliable performance under less - than - ideal conditions. ### Formula Examples When discussing model performance, the paper mentions the changes in some key indicators, such as accuracy, F1 - score, etc. The following are the formula representations: - **Accuracy**: \[ \text{Accuracy}=\frac{\text{TP}+\text{TN}}{\text{TP}+\text{FP}+\text{FN}+\text{TN}} \] where TP is true positive, TN is true negative, FP is false positive, and FN is false negative. - **F1 - Score**: \[ \text{F1 - Score}=2\times\frac{\text{Precision}\times\text{Recall}}{\text{Precision}+\text{Recall}} \] where Precision is precision rate and Recall is recall rate. Through these formulas, the performance changes of the model under different conditions can be more clearly understood.