Abstract:Recently, large language models (LLMs) have shown strong potential in code generation tasks. However, there are still gaps before they can be fully applied in actual software development processes. Accurately assessing the code generation capabilities of large language models has become an important basis for evaluating and improving the models. Some existing works have constructed datasets to evaluate the capabilities of these models. However, the current evaluation process may encounter the illusion of "Specialist in Familiarity", primarily due to three gaps: the exposure of target code, case timeliness, and dependency availability. The fundamental reason for these gaps is that the code in current datasets may have been extensively exposed and exercised during the training phase, and due to the continuous training and development of LLM, their timeliness has been severely compromised. The key to solve the problem is to, as much as possible, evaluate the LLMs using code that they have not encountered before. Thus, the fundamental idea in this paper is to draw on the concept of code obfuscation, changing code at different levels while ensuring the functionality and output. To this end, we build a code-obfuscation based benchmark OBFUSEVAL. We first collect 1,354 raw cases from five real-world projects, including function description and code. Then we use three-level strategy (symbol, structure and semantic) to obfuscate descriptions, code and context dependencies. We evaluate four LLMs on OBFU- SEVAL and compared the effectiveness of different obfuscation strategy. We use official test suites of these projects to evaluate the generated code. The results show that after obfuscation, the average decrease ratio of test pass rate can up to 62.5%.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to address the deficiencies in the current evaluation methods for large - language models (LLMs) in code - generation tasks. Specifically, the author points out three main problems in the existing evaluation methods and proposes a new benchmark - testing method based on code obfuscation to evaluate the code - generation ability of LLMs more objectively. #### 1. **Exposure of target code problem** - **Current situation**: Existing evaluation benchmarks mainly rewrite the function descriptions without modifying the code itself, which may cause the target code to have been exposed to LLMs during the pre - training phase, making the evaluation results less objective. - **Solution**: Select high - star projects from GitHub and pick the functions introduced after a specific time point from them to ensure that these codes have not appeared in the training data of LLMs. #### 2. **Timeliness of cases problem** - **Current situation**: With the rapid development of LLMs, data will be continuously updated and trained. Even if some benchmark tests collect historical modified codes, these codes may still exist in the training sets of subsequent versions, resulting in the inability to evaluate the new code - generation ability. - **Solution**: Perform multi - level strategy (symbol, structure, and semantic) obfuscation processing on all original data to ensure that the data set can be reused and will not become future training data. #### 3. **Provision of dependencies problem** - **Current situation**: Existing benchmark tests directly provide all the dependencies required for generating the target code, but in actual development scenarios, this condition is difficult to meet, resulting in evaluation results that do not match the actual situation. - **Solution**: Provide relevant code dependencies in a compromised way to simulate the real - world development scenario without deliberately sacrificing the generation ability of LLMs. Identify all the context dependencies required for each function and add some irrelevant dependencies to increase obfuscation. ### Main contributions of the paper 1. **Reveal the deficiencies of existing benchmarks**: Point out the three major problems in existing benchmarks when evaluating the code - generation ability of LLMs: exposure of target code, timeliness of cases, and provision of dependencies. 2. **Propose an obfuscation - based method**: Design different - level obfuscation strategies (symbol, structure, and semantic) and verify their effectiveness to prevent the target code from being exposed during the training phase. 3. **Construct a new benchmark OBFUS - EVAL**: Use codes from real - world projects to construct an obfuscated benchmark and evaluate it on four state - of - the - art code - generation models. The results show that after code obfuscation, the average decrease ratio of the test pass rate can reach 62.5%, indicating that the capabilities of existing LLMs may be overestimated. 4. **Discover non - functional code problems**: Even if all tests are passed, the code generated by LLMs may still have non - functional problems (such as code robustness), which provides guidance for developers to better understand and utilize the code generated by LLMs. ### Summary This paper solves the problems in the existing evaluation methods by proposing a new benchmark - testing method based on code obfuscation, making the evaluation of the code - generation ability of LLMs more objective and closer to the actual development scenario.

Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the Familiar

Large Language Models for Code Analysis: Do LLMs Really Do Their Job?

Showing LLM-Generated Code Selectively Based on Confidence of LLMs

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

Evaluating Large Language Models in Class-Level Code Generation

Improving Natural Language Capability of Code Large Language Model

Escalating LLM-based Code Translation Benchmarking into the Class-level Era

On Extracting Specialized Code Abilities from Large Language Models: A Feasibility Study

The First Prompt Counts the Most! An Evaluation of Large Language Models on Iterative Example-based Code Generation

How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark

On Evaluating the Efficiency of Source Code Generated by LLMs

Imperfect Code Generation: Uncovering Weaknesses in Automatic Code Generation by Large Language Models

LMs: Understanding Code Syntax and Semantics for Code Analysis

CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation

Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation

A Survey on Large Language Models for Code Generation

Large Language Models as Code Executors: An Exploratory Study

Teaching Code LLMs to Use Autocompletion Tools in Repository-Level Code Generation

If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents

MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation