Abstract:This study examines the impact of data snooping on neural networks for vulnerability detection in lifted code, building on previous research which used word2vec, and unidirectional and bidirectional transformer-based embeddings. The research specifically focuses on how model performance is affected when embedding models are trained on datasets, including samples also used for neural network training and validation. The results show that introducing data snooping did not significantly alter model performance, suggesting that data snooping had a minimal impact or that samples randomly dropped as part of the methodology contained hidden features critical to achieving optimal performance. In addition, the findings reinforce the conclusions of previous research, which found that models trained with GPT-2 embeddings consistently outperformed neural networks trained with other embeddings. The fact that this holds even when data snooping is introduced into the embedding model indicates GPT-2's robustness in representing complex code features, even under less-than-ideal conditions.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the impact of data snooping on deep - learning models when detecting vulnerabilities in compiled code. Specifically, the research focuses on how model performance will be affected when the embedding model uses the same samples in training and validating the neural network. By introducing data - snooping conditions, the author explores whether this design flaw will have a significant impact on the performance of the model. ### Research Background - **Security issues of closed - source software**: Many organizations rely on closed - source software, such as Microsoft Windows, Adobe Acrobat Reader, etc. Vulnerabilities in these software may affect system security on a global scale. - **Detecting vulnerabilities in compiled code**: Since closed - source software does not provide source code, researchers attempt to use machine - learning techniques to detect vulnerabilities in compiled code, especially stack - overflow (CWE - 121) vulnerabilities. - **The impact of data snooping**: Data snooping is a design flaw, which refers to using information that is unavailable in reality during model training. This phenomenon may inadvertently have a negative impact on model performance. ### Research Objectives This research aims to evaluate the impact of data snooping on deep - learning models for detecting vulnerabilities in compiled code, especially data snooping in the embedding model layer. The research hopes to reveal the robustness of the embedding model under non - ideal data conditions by introducing data - snooping conditions. ### Main Findings - **Robustness of the GPT - 2 model**: Even when data snooping is introduced, the neural network trained with the GPT - 2 embedding model still performs well, indicating that GPT - 2 has strong robustness in representing complex code features. - **Data snooping has a minor impact on model performance**: The results show that introducing data snooping does not significantly change the model performance, which may mean that the impact of data snooping on the model is limited, or the randomly dropped samples contain some hidden features that are crucial for optimal performance. ### Conclusions This research shows that, despite the introduction of data - snooping conditions, the performance of different embedding models remains relatively stable, especially when using the GPT - 2 embedding model. This indicates that these models have a certain degree of robustness in the face of data pollution and noise and can maintain reliable performance under less - than - ideal conditions. ### Formula Examples When discussing model performance, the paper mentions the changes in some key indicators, such as accuracy, F1 - score, etc. The following are the formula representations: - **Accuracy**: \[ \text{Accuracy}=\frac{\text{TP}+\text{TN}}{\text{TP}+\text{FP}+\text{FN}+\text{TN}} \] where TP is true positive, TN is true negative, FP is false positive, and FN is false negative. - **F1 - Score**: \[ \text{F1 - Score}=2\times\frac{\text{Precision}\times\text{Recall}}{\text{Precision}+\text{Recall}} \] where Precision is precision rate and Recall is recall rate. Through these formulas, the performance changes of the model under different conditions can be more clearly understood.

Impact of Data Snooping on Deep Learning Models for Locating Vulnerabilities in Lifted Code

Snopy: Bridging Sample Denoising with Causal Graph Learning for Effective Vulnerability Detection

The Secret Revealer: Generative Model-Inversion Attacks Against Deep Neural Networks

Comparing Unidirectional, Bidirectional, and Word2vec Models for Discovering Vulnerabilities in Compiled Lifted Code

Bi-Directional Transformers vs. word2vec: Discovering Vulnerabilities in Lifted Compiled Code

An extensive study of the effects of different deep learning models on code vulnerability detection in Python code

SGBA: A Stealthy Scapegoat Backdoor Attack Against Deep Neural Networks

Backdoor Attacks with Wavelet Embedding: Revealing and enhancing the insights of vulnerabilities in visual object detection models on transformers within digital twin systems

Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models

ES Attack: Model Stealing against Deep Neural Networks without Data Hurdles

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

TEN-GUARD: Tensor Decomposition for Backdoor Attack Detection in Deep Neural Networks

A Systematic View of Leakage Risks in Deep Neural Network Systems

Scalable Backdoor Detection in Neural Networks

Backdooring Neural Code Search

Security and Privacy Challenges in Deep Learning Models

Backdoor Vulnerabilities in Normally Trained Deep Learning Models

Pre-trained Encoder Inference: Revealing Upstream Encoders In Downstream Machine Learning Services

On Model Outsourcing Adaptive Attacks to Deep Learning Backdoor Defenses

Isolation and Induction: Training Robust Deep Neural Networks against Model Stealing Attacks

Seeing the Forest through the Trees: Data Leakage from Partial Transformer Gradients