Investigating large language models capabilities for automatic code repair in Python

Safwan Omari,Kshitiz Basnet,Mohammad Wardat
DOI: https://doi.org/10.1007/s10586-024-04490-8
2024-05-11
Cluster Computing
Abstract:Developers often encounter challenges with their introductory programming tasks as part of the development process. Unfortunately, rectifying these mistakes manually can be time-consuming and demanding. Automated program repair (APR) techniques offer a potential solution by synthesizing fixes for such errors. Previous research has investigated the utilization of both symbolic and neural techniques within the APR domain. However, these approaches typically demand significant engineering efforts or extensive datasets and training. In this paper, we explore the potential of using a large language model trained on code, specifically, we assess ChatGPT's capability to detect and repair bugs in simple Python programs. The experimental evaluation encompasses two benchmarks: QuixBugs and Textbook. Each benchmark consists of simple Python functions that implement well-known algorithms and each function contains a single bug. To gauge repair performance in various settings, several benchmark variations were introduced including addition of plain English documentation and code obfuscation. Based on thorough experiments, we found that ChatGPT was able to correctly detect and fix about 50% of the methods, when code is documented. Repair performance drops to 25% when code is obfuscated, and 15% when documentation is removed and code is obfuscated. Furthermore, when compared to existing APR systems, ChatGPT considerably outperformed them.
computer science, information systems, theory & methods
What problem does this paper attempt to address?