Abstract:Developers often perform repetitive code editing activities for various reasons (e.g., code refactoring) during software development. Pre-trained code editing models have achieved the state-of-the-art (SOTA) results. Pre-trained models are first pre-trained with pre-training tasks and fine-tuned with the code editing task. Existing pre-training tasks mainly are code infilling tasks (e.g., masked language modeling), which are derived from the natural language processing field and are not designed for automatic code editing. This paper proposes a novel pre-training task specialized in code editing and presents an effective pre-trained code editing model named CodeEditor. Our pre-training task further improves the performance and generalization ability of code editing models. Specifically, we collect lots of real-world code snippets as the ground truth and use a powerful generator to rewrite them into mutated versions. Then, we pre-train our CodeEditor to edit mutated versions into the corresponding ground truth, to learn edit patterns. We conduct experiments on four code editing datasets and evaluate the pre-trained CodeEditor in three settings. (1) In the fine-tuning setting, we train the pre-trained CodeEditor with four datasets and evaluate it on the test data. CodeEditor outperforms the SOTA baselines by 15%, 25.5%, and 9.4% and 26.6% on four datasets. (2) In the few-shot setting, we train the pre-trained CodeEditor with limited data and evaluate it on the test data. CodeEditor substantially performs better than all baselines. (3) In the zero-shot setting, CodeEditor correctly edits 1,113 programs while the SOTA baselines can not work.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the task of automating code editing during the software development process. Specifically, developers often perform repetitive code - editing activities (such as code refactoring) in software development, and these activities take up a large amount of their time (up to 70%). Many deep - learning (DL) - based models have been proposed to automate this process by learning code - editing history. However, the existing pre - training tasks are mainly derived from the field of natural language processing, such as Masked Language Modeling, and are not specifically designed for automatic code editing. Therefore, there is still room for improvement in the performance and generalization ability of these models in practical applications. To solve the above problems, this paper proposes a new pre - training task specifically for code editing and introduces an effective pre - training code - editing model - CodeEditor. Compared with previous code - filling tasks, the new pre - training task rewrites real code fragments into mutated versions by using a powerful generator, and then trains CodeEditor to edit these mutated versions back to the original versions, thereby learning the editing patterns. This method aims to improve the performance and generalization ability of the code - editing model. The main contributions of the paper include: 1. Proposing a new pre - training task that can train the model to edit the automatically generated mutated programs back to the real versions, thereby improving the performance and generalization ability of the code - editing model. 2. Constructing a large - scale dataset and pre - training an effective code - editing model named CodeEditor on this basis. 3. Fine - tuning the pre - trained CodeEditor through four code - editing datasets. The experimental results show that CodeEditor outperforms the existing state - of - the - art (SOTA) baseline models on multiple datasets, with a maximum outperformance of 26.6%. 4. Evaluating the pre - trained CodeEditor in zero - shot and few - shot settings. The results show that even without additional training, CodeEditor also exhibits strong generalization ability.

CodeEditor: Learning to Edit Source Code with Pre-trained Models

Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions

Coeditor: Leveraging Contextual Changes for Multi-round Code Auto-editing

InstructCoder: Instruction Tuning Large Language Models for Code Editing

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

CoEdPilot: Recommending Code Edits with Learned Prior Edit Relevance, Project-wise Awareness, and Interactive Nature

Model Editing for LLMs4Code: How Far are We?

Automating Code Review Activities by Large-Scale Pre-training

InstructEdit: Instruction-based Knowledge Editing for Large Language Models

CoEdIT: Text Editing by Task-Specific Instruction Tuning

CODIT: Code Editing with Tree-Based Neural Models

On Multi-Modal Learning of Editing Source Code

Neural Networks for Modeling Source Code Edits

EditEval: An Instruction-Based Benchmark for Text Improvements

CCT5: A Code-Change-Oriented Pre-Trained Model

Knowledge Editing through Chain-of-Thought

OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

Self-Edit: Fault-Aware Code Editor for Code Generation

Enhancing Code Intelligence Tasks with ChatGPT

Training Language Models on Synthetic Edit Sequences Improves Code Synthesis

Improving Code Refinement for Code Review Via Input Reconstruction and Ensemble Learning