CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

Jiawei Guo,Ziming Li,Xueling Liu,Kaijing Ma,Tianyu Zheng,Zhouliang Yu,Ding Pan,Yizhi LI,Ruibo Liu,Yue Wang,Shuyue Guo,Xingwei Qu,Xiang Yue,Ge Zhang,Wenhu Chen,Jie Fu
2024-04-06
Abstract:Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, an evaluation framework designed to rigorously assess the performance of LLMs in code editing tasks, including debugging, translating, polishing, and requirement switching. Unlike existing benchmarks focusing solely on code generation, CodeEditorBench emphasizes real-world scenarios and practical aspects of software development. We curate diverse coding challenges and scenarios from five sources, covering various programming languages, complexity levels, and editing tasks. Evaluation of 19 LLMs reveals that closed-source models (particularly Gemini-Ultra and GPT-4), outperform open-source models in CodeEditorBench, highlighting differences in model performance based on problem types and prompt sensitivities. CodeEditorBench aims to catalyze advancements in LLMs by providing a robust platform for assessing code editing capabilities. We will release all prompts and datasets to enable the community to expand the dataset and benchmark emerging LLMs. By introducing CodeEditorBench, we contribute to the advancement of LLMs in code editing and provide a valuable resource for researchers and practitioners.
Software Engineering,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of ignoring code - editing capabilities in existing code - generation evaluation benchmarks. Specifically, existing evaluation methods mainly focus on code generation and overlook crucial code - editing tasks in the software development process, such as debugging, translation, optimization, and requirement switching. To fill this gap, the authors introduce **CodeEditorBench**, a framework specifically designed to evaluate the performance of large - language models (LLMs) in code - editing tasks. ### Main contributions 1. **Comprehensive evaluation framework**: - **Diverse tasks**: Covers four common code - editing tasks: debugging, translation, optimization, and requirement switching. - **Multi - source datasets**: Collected diverse programming challenges from five different sources, including multiple programming languages, complexity levels, and editing tasks. - **Strict evaluation criteria**: Conduct strict performance evaluations through an online judge system (OJ) to ensure the accuracy and reliability of the results. 2. **Dataset construction**: - **Data screening**: Screen the initial data according to the code length and the number of tokens to ensure that the generated code conforms to the actual usage scenarios. - **Test case generation**: Use large - language models to generate test cases and verify their correctness through the OJ system. - **Data filtering**: Implement a filtering process based on timestamps to exclude outdated information and improve the quality of the dataset. 3. **Model evaluation**: - **Extensive model evaluation**: Evaluated 19 popular LLMs, including open - source and closed - source models, covering base models and instruction - tuned models of different scales. - **Multiple evaluation settings**: Employ multiple prompting methods such as zero - shot, three - shot, and chain - of - thought prompting to comprehensively evaluate the performance of the models. 4. **Performance analysis**: - **Performance comparison**: Showed the superior performance of closed - source models (especially Gemini - Ultra and GPT - 4) on most tasks. - **Task - type sensitivity**: Analyzed the impact of different task types on model performance, revealing the advantages and disadvantages of the models on specific tasks. ### Conclusion By introducing **CodeEditorBench**, the authors hope to promote the research and development of large - language models in the code - editing field, provide a comprehensive and practical evaluation platform, and help researchers and practitioners better understand and improve the performance of these models.