Abstract:Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, an evaluation framework designed to rigorously assess the performance of LLMs in code editing tasks, including debugging, translating, polishing, and requirement switching. Unlike existing benchmarks focusing solely on code generation, CodeEditorBench emphasizes real-world scenarios and practical aspects of software development. We curate diverse coding challenges and scenarios from five sources, covering various programming languages, complexity levels, and editing tasks. Evaluation of 19 LLMs reveals that closed-source models (particularly Gemini-Ultra and GPT-4), outperform open-source models in CodeEditorBench, highlighting differences in model performance based on problem types and prompt sensitivities. CodeEditorBench aims to catalyze advancements in LLMs by providing a robust platform for assessing code editing capabilities. We will release all prompts and datasets to enable the community to expand the dataset and benchmark emerging LLMs. By introducing CodeEditorBench, we contribute to the advancement of LLMs in code editing and provide a valuable resource for researchers and practitioners.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of ignoring code - editing capabilities in existing code - generation evaluation benchmarks. Specifically, existing evaluation methods mainly focus on code generation and overlook crucial code - editing tasks in the software development process, such as debugging, translation, optimization, and requirement switching. To fill this gap, the authors introduce **CodeEditorBench**, a framework specifically designed to evaluate the performance of large - language models (LLMs) in code - editing tasks. ### Main contributions 1. **Comprehensive evaluation framework**: - **Diverse tasks**: Covers four common code - editing tasks: debugging, translation, optimization, and requirement switching. - **Multi - source datasets**: Collected diverse programming challenges from five different sources, including multiple programming languages, complexity levels, and editing tasks. - **Strict evaluation criteria**: Conduct strict performance evaluations through an online judge system (OJ) to ensure the accuracy and reliability of the results. 2. **Dataset construction**: - **Data screening**: Screen the initial data according to the code length and the number of tokens to ensure that the generated code conforms to the actual usage scenarios. - **Test case generation**: Use large - language models to generate test cases and verify their correctness through the OJ system. - **Data filtering**: Implement a filtering process based on timestamps to exclude outdated information and improve the quality of the dataset. 3. **Model evaluation**: - **Extensive model evaluation**: Evaluated 19 popular LLMs, including open - source and closed - source models, covering base models and instruction - tuned models of different scales. - **Multiple evaluation settings**: Employ multiple prompting methods such as zero - shot, three - shot, and chain - of - thought prompting to comprehensively evaluate the performance of the models. 4. **Performance analysis**: - **Performance comparison**: Showed the superior performance of closed - source models (especially Gemini - Ultra and GPT - 4) on most tasks. - **Task - type sensitivity**: Analyzed the impact of different task types on model performance, revealing the advantages and disadvantages of the models on specific tasks. ### Conclusion By introducing **CodeEditorBench**, the authors hope to promote the research and development of large - language models in the code - editing field, provide a comprehensive and practical evaluation platform, and help researchers and practitioners better understand and improve the performance of these models.

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Model Editing for LLMs4Code: How Far are We?

InstructCoder: Instruction Tuning Large Language Models for Code Editing

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

DebugBench: Evaluating Debugging Capability of Large Language Models

ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations

ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks

InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

InfiCoder-Eval: Systematically Evaluating the Question-Answering Capabilities of Code Large Language Models.

FullStack Bench: Evaluating LLMs as Full Stack Coders

TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models

CodeEditor: Learning to Edit Source Code with Pre-trained Models

MdEval: Massively Multilingual Code Debugging

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

McEval: Massively Multilingual Code Evaluation