CodeEditor: Learning to Edit Source Code with Pre-trained Models

Jia Li,Ge Li,Zhuo Li,Zhi Jin,Xing Hu,Kechi Zhang,Zhiyi Fu
DOI: https://doi.org/10.1145/3597207
2023-09-07
Abstract:Developers often perform repetitive code editing activities for various reasons (e.g., code refactoring) during software development. Pre-trained code editing models have achieved the state-of-the-art (SOTA) results. Pre-trained models are first pre-trained with pre-training tasks and fine-tuned with the code editing task. Existing pre-training tasks mainly are code infilling tasks (e.g., masked language modeling), which are derived from the natural language processing field and are not designed for automatic code editing. This paper proposes a novel pre-training task specialized in code editing and presents an effective pre-trained code editing model named CodeEditor. Our pre-training task further improves the performance and generalization ability of code editing models. Specifically, we collect lots of real-world code snippets as the ground truth and use a powerful generator to rewrite them into mutated versions. Then, we pre-train our CodeEditor to edit mutated versions into the corresponding ground truth, to learn edit patterns. We conduct experiments on four code editing datasets and evaluate the pre-trained CodeEditor in three settings. (1) In the fine-tuning setting, we train the pre-trained CodeEditor with four datasets and evaluate it on the test data. CodeEditor outperforms the SOTA baselines by 15%, 25.5%, and 9.4% and 26.6% on four datasets. (2) In the few-shot setting, we train the pre-trained CodeEditor with limited data and evaluate it on the test data. CodeEditor substantially performs better than all baselines. (3) In the zero-shot setting, CodeEditor correctly edits 1,113 programs while the SOTA baselines can not work.
Software Engineering,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the task of automating code editing during the software development process. Specifically, developers often perform repetitive code - editing activities (such as code refactoring) in software development, and these activities take up a large amount of their time (up to 70%). Many deep - learning (DL) - based models have been proposed to automate this process by learning code - editing history. However, the existing pre - training tasks are mainly derived from the field of natural language processing, such as Masked Language Modeling, and are not specifically designed for automatic code editing. Therefore, there is still room for improvement in the performance and generalization ability of these models in practical applications. To solve the above problems, this paper proposes a new pre - training task specifically for code editing and introduces an effective pre - training code - editing model - CodeEditor. Compared with previous code - filling tasks, the new pre - training task rewrites real code fragments into mutated versions by using a powerful generator, and then trains CodeEditor to edit these mutated versions back to the original versions, thereby learning the editing patterns. This method aims to improve the performance and generalization ability of the code - editing model. The main contributions of the paper include: 1. Proposing a new pre - training task that can train the model to edit the automatically generated mutated programs back to the real versions, thereby improving the performance and generalization ability of the code - editing model. 2. Constructing a large - scale dataset and pre - training an effective code - editing model named CodeEditor on this basis. 3. Fine - tuning the pre - trained CodeEditor through four code - editing datasets. The experimental results show that CodeEditor outperforms the existing state - of - the - art (SOTA) baseline models on multiple datasets, with a maximum outperformance of 26.6%. 4. Evaluating the pre - trained CodeEditor in zero - shot and few - shot settings. The results show that even without additional training, CodeEditor also exhibits strong generalization ability.