Abstract:Regarding software engineering (SE) tasks, Large language models (LLMs) have the capability of zero-shot learning, which does not require training or fine-tuning, unlike pre-trained models (PTMs). However, LLMs are primarily designed for natural language output, and cannot directly produce intermediate embeddings from source code. They also face some challenges, for example, the restricted context length may prevent them from handling larger inputs, limiting their applicability to many SE tasks; while hallucinations may occur when LLMs are applied to complex downstream tasks. Motivated by the above facts, we propose zsLLMCode, a novel approach that generates functional code embeddings using LLMs. Our approach utilizes LLMs to convert source code into concise summaries through zero-shot learning, which is then transformed into functional code embeddings using specialized embedding models. This unsupervised approach eliminates the need for training and addresses the issue of hallucinations encountered with LLMs. To the best of our knowledge, this is the first approach that combines LLMs and embedding models to generate code embeddings. We conducted experiments to evaluate the performance of our approach. The results demonstrate the effectiveness and superiority of our approach over state-of-the-art unsupervised methods.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are as follows: In software engineering tasks, although large - language models (LLMs) have zero - sample learning capabilities, they face two main challenges when generating code embedding representations - context length limitations and the hallucination problem. Specifically: 1. **Context length limitations**: The context length of LLMs is fixed, which restricts their ability to handle long code snippets. For example, the maximum context length of GPT - 3.5 Turbo is 4,096 tokens, which is insufficient for tasks that deal with a large number of code snippets, such as code classification, clustering, and searching. When the input exceeds this length, LLMs may truncate the input or lose context, resulting in incomplete or inaccurate analysis. 2. **Hallucination problem**: LLMs are prone to hallucination when handling complex tasks, that is, they generate outputs with inaccurate or contradictory facts. In tasks such as code clone detection, even if LLMs can correctly summarize the functions of the code, they may misclassify code snippets, thus affecting their reliability. To solve these problems, the authors propose a new method - **zsLLMCode**, which utilizes the zero - sample learning capabilities of LLMs and sentence - embedding models to generate functional code embeddings without additional training or fine - tuning. The main objectives of this method are: - **Overcoming context length limitations**: By processing code snippets one by one into code embeddings, the context length limitations of LLMs are eliminated. - **Alleviating the hallucination problem**: By only summarizing the code and using the sentence - embedding model to generate code embeddings, the possibility of LLMs generating hallucinations is reduced. - **Providing an efficient and resource - effective solution**: It does not require a large amount of training data or computing resources and is applicable to multiple programming languages and downstream tasks. In conclusion, this paper aims to propose a novel and effective method for generating functional code embeddings by combining LLMs and sentence - embedding models, thereby improving the efficiency and accuracy of software engineering tasks.

zsLLMCode: An Effective Approach for Functional Code Embedding via LLM with Zero-Shot Learning

LLM Hallucinations in Practical Code Generation: Phenomena, Mechanism, and Mitigation

Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting

Bridging the Language Gap: Enhancing Multilingual Prompt-Based Code Generation in LLMs via Zero-Shot Cross-Lingual Transfer

LMs: Understanding Code Syntax and Semantics for Code Analysis

Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the Familiar

If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents

CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification

Exploring and Evaluating Hallucinations in LLM-Powered Code Generation

On Extracting Specialized Code Abilities from Large Language Models: A Feasibility Study

Large Language Models for Code Analysis: Do LLMs Really Do Their Job?

Llasm: Naming Functions in Binaries by Fusing Encoder-only and Decoder-only LLMs

How Far Have We Gone in Binary Code Understanding Using Large Language Models

Large Language Models as Code Executors: An Exploratory Study

EmbedLLM: Learning Compact Representations of Large Language Models

SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM

Improving Natural Language Capability of Code Large Language Model

VideoLLM: Modeling Video Sequence with Large Language Models

Code-mixed LLM: Improve Large Language Models' Capability to Handle Code-Mixing through Reinforcement Learning from AI Feedback