zsLLMCode: An Effective Approach for Functional Code Embedding via LLM with Zero-Shot Learning

Zixiang Xian,Chenhui Cui,Rubing Huang,Chunrong Fang,Zhenyu Chen
2024-09-23
Abstract:Regarding software engineering (SE) tasks, Large language models (LLMs) have the capability of zero-shot learning, which does not require training or fine-tuning, unlike pre-trained models (PTMs). However, LLMs are primarily designed for natural language output, and cannot directly produce intermediate embeddings from source code. They also face some challenges, for example, the restricted context length may prevent them from handling larger inputs, limiting their applicability to many SE tasks; while hallucinations may occur when LLMs are applied to complex downstream tasks. Motivated by the above facts, we propose zsLLMCode, a novel approach that generates functional code embeddings using LLMs. Our approach utilizes LLMs to convert source code into concise summaries through zero-shot learning, which is then transformed into functional code embeddings using specialized embedding models. This unsupervised approach eliminates the need for training and addresses the issue of hallucinations encountered with LLMs. To the best of our knowledge, this is the first approach that combines LLMs and embedding models to generate code embeddings. We conducted experiments to evaluate the performance of our approach. The results demonstrate the effectiveness and superiority of our approach over state-of-the-art unsupervised methods.
Software Engineering,Artificial Intelligence
What problem does this paper attempt to address?
The problems that this paper attempts to solve are as follows: In software engineering tasks, although large - language models (LLMs) have zero - sample learning capabilities, they face two main challenges when generating code embedding representations - context length limitations and the hallucination problem. Specifically: 1. **Context length limitations**: The context length of LLMs is fixed, which restricts their ability to handle long code snippets. For example, the maximum context length of GPT - 3.5 Turbo is 4,096 tokens, which is insufficient for tasks that deal with a large number of code snippets, such as code classification, clustering, and searching. When the input exceeds this length, LLMs may truncate the input or lose context, resulting in incomplete or inaccurate analysis. 2. **Hallucination problem**: LLMs are prone to hallucination when handling complex tasks, that is, they generate outputs with inaccurate or contradictory facts. In tasks such as code clone detection, even if LLMs can correctly summarize the functions of the code, they may misclassify code snippets, thus affecting their reliability. To solve these problems, the authors propose a new method - **zsLLMCode**, which utilizes the zero - sample learning capabilities of LLMs and sentence - embedding models to generate functional code embeddings without additional training or fine - tuning. The main objectives of this method are: - **Overcoming context length limitations**: By processing code snippets one by one into code embeddings, the context length limitations of LLMs are eliminated. - **Alleviating the hallucination problem**: By only summarizing the code and using the sentence - embedding model to generate code embeddings, the possibility of LLMs generating hallucinations is reduced. - **Providing an efficient and resource - effective solution**: It does not require a large amount of training data or computing resources and is applicable to multiple programming languages and downstream tasks. In conclusion, this paper aims to propose a novel and effective method for generating functional code embeddings by combining LLMs and sentence - embedding models, thereby improving the efficiency and accuracy of software engineering tasks.