Abstract:Vulnerability detection is garnering increasing attention in software engineering, since code vulnerabilities possibly pose significant security. Recently, reusing various code pre-trained models has become common for code embedding without providing reasonable justifications in vulnerability detection. The premise for casually utilizing pre-trained models (PTMs) is that the code embeddings generated by different PTMs would generate a similar impact on the performance. Is that TRUE? To answer this important question, we systematically investigate the effects of code embedding generated by ten different code PTMs on the performance of vulnerability detection, and get the answer, i.e., that is NOT true. We observe that code embedding generated by various code PTMs can indeed influence the performance and selecting an embedding technique based on parameter scales and embedding dimension is not reliable. Our findings highlight the necessity of quantifying and evaluating the characteristics of code embedding generated by various code PTMs to understand the effects. To achieve this goal, we analyze the numerical representation and data distribution of code embedding generated by different PTMs to evaluate differences and characteristics. Based on these insights, we propose Coding-PTMs, a recommendation framework to assist engineers in selecting optimal code PTMs for their specific vulnerability detection tasks. Specifically, we define thirteen code embedding metrics across three dimensions (i.e., statistics, norm, and distribution) for constructing a specialized code PTM recommendation dataset. We then employ a Random Forest classifier to train a recommendation model and identify the optimal code PTMs from the candidate model zoo.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to select the optimal pre - trained models (PTMs) for code embedding for vulnerability detection tasks?** ### Problem Background In the field of software engineering, code vulnerability detection is an important research direction. Existing methods include those based on static analysis, machine learning, and deep learning. However, these methods either require additional time to learn professional tools (such as static analysis), or need to train complete models from scratch (such as machine learning and deep learning), which wastes a large amount of resources. In recent years, with the popularity of pre - trained models (PTMs), researchers have begun to use these models for code representation and then apply them to vulnerability detection tasks to save training costs and improve performance. ### Research Motivation Although many studies have used different code pre - trained models (such as CodeBERT, CodeT5, CodeGen, etc.) for code embedding, these studies have not provided convincing reasons to explain why a specific code PTM is chosen to generate context embeddings, nor have they compared the effects of different code PTMs in generating code embeddings. Therefore, a pre - assumption is that: the code embeddings generated by different code PTMs have a similar impact on the performance of vulnerability detection tasks. However, this assumption has not been thoroughly verified. ### Research Questions To answer this important question, the author conducted a systematic study and explored the following two research questions: 1. **RQ1: Will the code embeddings generated by different code PTMs affect the performance of vulnerability detection tasks?** 2. **RQ2: What are the characteristics of the code embeddings generated by different PTMs?** ### Main Findings Through experiments, the author reached the following conclusions: - **The code embeddings generated by different code PTMs do indeed have a significant impact on the performance of vulnerability detection tasks**. A larger parameter scale does not necessarily mean generating higher - dimensional code embeddings or better task performance. - **The code embeddings generated by different code PTMs have significant differences in numerical distribution, numerical range, and data distribution**. For example, the code embeddings generated by the CodeT5 family tend to show an almost perfect normal distribution, while the PolyCoder family tends to show a skewed distribution. ### Solutions Based on the above findings, the author proposed a recommendation framework named **Coding - PTMs**, aiming to help engineers select the code PTM that is most suitable for their specific vulnerability detection tasks. Specifically, the author defined thirteen code embedding measurement indicators, covering three dimensions (statistics, norms, and distribution), and constructed a new code embedding data set. Then, a random forest classifier was used to train the recommendation model to determine whether the candidate code PTM can generate high - quality code embeddings and obtain better task performance. ### Contributions The main contributions of this paper include: 1. **Systematically studied the impact of code embeddings generated by different code PTMs on vulnerability detection tasks**, and analyzed the differences and characteristics of these embeddings in code vulnerability detection tasks. 2. **Proposed a set of thirteen measurement indicators** for quantifying the differences and characteristics between multiple code embeddings. 3. **Provided a recommendation framework based on code embedding measurement indicators** to guide software engineering researchers and practitioners to select appropriate code PTMs to generate high - quality code embeddings, thereby improving task performance. Through these works, the author provides a scientific basis for selecting code pre - trained models suitable for specific vulnerability detection tasks and lays the foundation for further research in related fields.

Coding-PTMs: How to Find Optimal Code Pre-trained Models for Code Embedding in Vulnerability Detection?

Function-Level Vulnerability Detection Through Fusing Multi-Modal Knowledge

How to get better embeddings with code pre-trained models? An empirical study

DFEPT: Data Flow Embedding for Enhancing Pre-Trained Model Based Vulnerability Detection

Combining Software Metrics and Text Features for Vulnerable File Prediction

An extensive study of the effects of different deep learning models on code vulnerability detection in Python code

Vulnerability Detection with Code Language Models: How Far Are We?

Can An Old Fashioned Feature Extraction and A Light-weight Model Improve Vulnerability Type Identification Performance?

VulMPFF: A Vulnerability Detection Method for Fusing Code Features in Multiple Perspectives

An Improved Vulnerability Exploitation Prediction Model with Novel Cost Function and Custom Trained Word Vector Embedding

The impact factors on the performance of machine learning-based vulnerability detection: A comparative study

Automated Software Vulnerability Static Code Analysis Using Generative Pre-Trained Transformer Models

Generalization-Enhanced Code Vulnerability Detection via Multi-Task Instruction Fine-Tuning

Enhancing Code Vulnerability Detection via Vulnerability-Preserving Data Augmentation

Explaining the Contributing Factors for Vulnerability Detection in Machine Learning

VulMCI : Code Splicing-based Pixel-row Oversampling for More Continuous Vulnerability Image Generation

Pre-training by Predicting Program Dependencies for Vulnerability Analysis Tasks

StagedVulBERT: Multi-Granular Vulnerability Detection with a Novel Pre-trained Code Model

DeepVulSeeker: A novel vulnerability identification framework via code graph structure and pre-training mechanism

FastEmbed: Predicting vulnerability exploitation possibility based on ensemble machine learning algorithm

An empirical study of text-based machine learning models for vulnerability detection