Coding-PTMs: How to Find Optimal Code Pre-trained Models for Code Embedding in Vulnerability Detection?

Yu Zhao,Lina Gong,Zhiqiu Huang,Yongwei Wang,Mingqiang Wei,Fei Wu
2024-08-09
Abstract:Vulnerability detection is garnering increasing attention in software engineering, since code vulnerabilities possibly pose significant security. Recently, reusing various code pre-trained models has become common for code embedding without providing reasonable justifications in vulnerability detection. The premise for casually utilizing pre-trained models (PTMs) is that the code embeddings generated by different PTMs would generate a similar impact on the performance. Is that TRUE? To answer this important question, we systematically investigate the effects of code embedding generated by ten different code PTMs on the performance of vulnerability detection, and get the answer, i.e., that is NOT true. We observe that code embedding generated by various code PTMs can indeed influence the performance and selecting an embedding technique based on parameter scales and embedding dimension is not reliable. Our findings highlight the necessity of quantifying and evaluating the characteristics of code embedding generated by various code PTMs to understand the effects. To achieve this goal, we analyze the numerical representation and data distribution of code embedding generated by different PTMs to evaluate differences and characteristics. Based on these insights, we propose Coding-PTMs, a recommendation framework to assist engineers in selecting optimal code PTMs for their specific vulnerability detection tasks. Specifically, we define thirteen code embedding metrics across three dimensions (i.e., statistics, norm, and distribution) for constructing a specialized code PTM recommendation dataset. We then employ a Random Forest classifier to train a recommendation model and identify the optimal code PTMs from the candidate model zoo.
Software Engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to select the optimal pre - trained models (PTMs) for code embedding for vulnerability detection tasks?** ### Problem Background In the field of software engineering, code vulnerability detection is an important research direction. Existing methods include those based on static analysis, machine learning, and deep learning. However, these methods either require additional time to learn professional tools (such as static analysis), or need to train complete models from scratch (such as machine learning and deep learning), which wastes a large amount of resources. In recent years, with the popularity of pre - trained models (PTMs), researchers have begun to use these models for code representation and then apply them to vulnerability detection tasks to save training costs and improve performance. ### Research Motivation Although many studies have used different code pre - trained models (such as CodeBERT, CodeT5, CodeGen, etc.) for code embedding, these studies have not provided convincing reasons to explain why a specific code PTM is chosen to generate context embeddings, nor have they compared the effects of different code PTMs in generating code embeddings. Therefore, a pre - assumption is that: the code embeddings generated by different code PTMs have a similar impact on the performance of vulnerability detection tasks. However, this assumption has not been thoroughly verified. ### Research Questions To answer this important question, the author conducted a systematic study and explored the following two research questions: 1. **RQ1: Will the code embeddings generated by different code PTMs affect the performance of vulnerability detection tasks?** 2. **RQ2: What are the characteristics of the code embeddings generated by different PTMs?** ### Main Findings Through experiments, the author reached the following conclusions: - **The code embeddings generated by different code PTMs do indeed have a significant impact on the performance of vulnerability detection tasks**. A larger parameter scale does not necessarily mean generating higher - dimensional code embeddings or better task performance. - **The code embeddings generated by different code PTMs have significant differences in numerical distribution, numerical range, and data distribution**. For example, the code embeddings generated by the CodeT5 family tend to show an almost perfect normal distribution, while the PolyCoder family tends to show a skewed distribution. ### Solutions Based on the above findings, the author proposed a recommendation framework named **Coding - PTMs**, aiming to help engineers select the code PTM that is most suitable for their specific vulnerability detection tasks. Specifically, the author defined thirteen code embedding measurement indicators, covering three dimensions (statistics, norms, and distribution), and constructed a new code embedding data set. Then, a random forest classifier was used to train the recommendation model to determine whether the candidate code PTM can generate high - quality code embeddings and obtain better task performance. ### Contributions The main contributions of this paper include: 1. **Systematically studied the impact of code embeddings generated by different code PTMs on vulnerability detection tasks**, and analyzed the differences and characteristics of these embeddings in code vulnerability detection tasks. 2. **Proposed a set of thirteen measurement indicators** for quantifying the differences and characteristics between multiple code embeddings. 3. **Provided a recommendation framework based on code embedding measurement indicators** to guide software engineering researchers and practitioners to select appropriate code PTMs to generate high - quality code embeddings, thereby improving task performance. Through these works, the author provides a scientific basis for selecting code pre - trained models suitable for specific vulnerability detection tasks and lays the foundation for further research in related fields.