Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the Familiar

Yuanliang Zhang,Yifan Xie,Shanshan Li,Ke Liu,Chong Wang,Zhouyang Jia,Xiangbing Huang,Jie Song,Chaopeng Luo,Zhizheng Zheng,Rulin Xu,Yitong Liu,Si Zheng,Xiangke Liao
2024-12-11
Abstract:Recently, large language models (LLMs) have shown strong potential in code generation tasks. However, there are still gaps before they can be fully applied in actual software development processes. Accurately assessing the code generation capabilities of large language models has become an important basis for evaluating and improving the models. Some existing works have constructed datasets to evaluate the capabilities of these models. However, the current evaluation process may encounter the illusion of "Specialist in Familiarity", primarily due to three gaps: the exposure of target code, case timeliness, and dependency availability. The fundamental reason for these gaps is that the code in current datasets may have been extensively exposed and exercised during the training phase, and due to the continuous training and development of LLM, their timeliness has been severely compromised. The key to solve the problem is to, as much as possible, evaluate the LLMs using code that they have not encountered before. Thus, the fundamental idea in this paper is to draw on the concept of code obfuscation, changing code at different levels while ensuring the functionality and output. To this end, we build a code-obfuscation based benchmark OBFUSEVAL. We first collect 1,354 raw cases from five real-world projects, including function description and code. Then we use three-level strategy (symbol, structure and semantic) to obfuscate descriptions, code and context dependencies. We evaluate four LLMs on OBFU- SEVAL and compared the effectiveness of different obfuscation strategy. We use official test suites of these projects to evaluate the generated code. The results show that after obfuscation, the average decrease ratio of test pass rate can up to 62.5%.
Software Engineering,Artificial Intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to address the deficiencies in the current evaluation methods for large - language models (LLMs) in code - generation tasks. Specifically, the author points out three main problems in the existing evaluation methods and proposes a new benchmark - testing method based on code obfuscation to evaluate the code - generation ability of LLMs more objectively. #### 1. **Exposure of target code problem** - **Current situation**: Existing evaluation benchmarks mainly rewrite the function descriptions without modifying the code itself, which may cause the target code to have been exposed to LLMs during the pre - training phase, making the evaluation results less objective. - **Solution**: Select high - star projects from GitHub and pick the functions introduced after a specific time point from them to ensure that these codes have not appeared in the training data of LLMs. #### 2. **Timeliness of cases problem** - **Current situation**: With the rapid development of LLMs, data will be continuously updated and trained. Even if some benchmark tests collect historical modified codes, these codes may still exist in the training sets of subsequent versions, resulting in the inability to evaluate the new code - generation ability. - **Solution**: Perform multi - level strategy (symbol, structure, and semantic) obfuscation processing on all original data to ensure that the data set can be reused and will not become future training data. #### 3. **Provision of dependencies problem** - **Current situation**: Existing benchmark tests directly provide all the dependencies required for generating the target code, but in actual development scenarios, this condition is difficult to meet, resulting in evaluation results that do not match the actual situation. - **Solution**: Provide relevant code dependencies in a compromised way to simulate the real - world development scenario without deliberately sacrificing the generation ability of LLMs. Identify all the context dependencies required for each function and add some irrelevant dependencies to increase obfuscation. ### Main contributions of the paper 1. **Reveal the deficiencies of existing benchmarks**: Point out the three major problems in existing benchmarks when evaluating the code - generation ability of LLMs: exposure of target code, timeliness of cases, and provision of dependencies. 2. **Propose an obfuscation - based method**: Design different - level obfuscation strategies (symbol, structure, and semantic) and verify their effectiveness to prevent the target code from being exposed during the training phase. 3. **Construct a new benchmark OBFUS - EVAL**: Use codes from real - world projects to construct an obfuscated benchmark and evaluate it on four state - of - the - art code - generation models. The results show that after code obfuscation, the average decrease ratio of the test pass rate can reach 62.5%, indicating that the capabilities of existing LLMs may be overestimated. 4. **Discover non - functional code problems**: Even if all tests are passed, the code generated by LLMs may still have non - functional problems (such as code robustness), which provides guidance for developers to better understand and utilize the code generated by LLMs. ### Summary This paper solves the problems in the existing evaluation methods by proposing a new benchmark - testing method based on code obfuscation, making the evaluation of the code - generation ability of LLMs more objective and closer to the actual development scenario.