Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study

Qi Guo,Junming Cao,Xiaofei Xie,Shangqing Liu,Xiaohong Li,Bihuan Chen,Xin Peng
DOI: https://doi.org/10.48550/arXiv.2309.08221
2023-09-15
Abstract:Code review is an essential activity for ensuring the quality and maintainability of software projects. However, it is a time-consuming and often error-prone task that can significantly impact the development process. Recently, ChatGPT, a cutting-edge language model, has demonstrated impressive performance in various natural language processing tasks, suggesting its potential to automate code review processes. However, it is still unclear how well ChatGPT performs in code review tasks. To fill this gap, in this paper, we conduct the first empirical study to understand the capabilities of ChatGPT in code review tasks, specifically focusing on automated code refinement based on given code reviews. To conduct the study, we select the existing benchmark CodeReview and construct a new code review dataset with high quality. We use CodeReviewer, a state-of-the-art code review tool, as a baseline for comparison with ChatGPT. Our results show that ChatGPT outperforms CodeReviewer in code refinement tasks. Specifically, our results show that ChatGPT achieves higher EM and BLEU scores of 22.78 and 76.44 respectively, while the state-of-the-art method achieves only 15.50 and 62.88 on a high-quality code review dataset. We further identify the root causes for ChatGPT's underperformance and propose several strategies to mitigate these challenges. Our study provides insights into the potential of ChatGPT in automating the code review process, and highlights the potential research directions.
Software Engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the potential of ChatGPT in code review tasks, especially the ability of automated code optimization based on given code reviews. Specifically, the paper focuses on the following aspects: 1. **Factors Affecting ChatGPT Performance**: Research the influence of different prompts and temperature settings on ChatGPT's performance in code optimization tasks. 2. **Comparison between ChatGPT and Existing Methods**: Compare the performance of ChatGPT with the current state - of - the - art code review tools (such as CodeReviewer) in code optimization tasks. 3. **Advantages and Disadvantages of ChatGPT**: Analyze the situations in which ChatGPT performs well or poorly and explore the reasons behind them. 4. **Improvement Strategies**: Propose methods to alleviate the challenges that ChatGPT encounters in code optimization tasks. ### Research Background Code review is a crucial activity in the software development process to ensure code quality and maintainability. However, code review is a time - consuming and error - prone task, which may seriously affect the development process. Recently, ChatGPT, as an advanced language model, has performed excellently in natural language processing tasks, showing potential in the automated code review process. However, ChatGPT's actual performance in code review tasks remains unclear. ### Research Design To evaluate ChatGPT's ability in code optimization tasks, the author carried out the following work: 1. **Data Sets**: - **CodeReview**: A widely - used code review data set, which contains code review data from the top 10,000 repositories on GitHub. - **CodeReview - New**: A new code review data set, which consists of two parts: - **CodeReview - NewTime**: More recent code review data collected from the same repositories as the CodeReview data set. - **CodeReview - NewLanguage**: Code review data collected from repositories using different programming languages. 2. **Experimental Settings**: - **Prompt**: Five different prompts were designed, including simple prompts, scenario descriptions, detailed requirements, concise requirements, and their combinations. - **Temperature**: Five different temperature values (0, 0.5, 1.0, 1.5, 2.0) were selected to evaluate their influence on ChatGPT's performance. 3. **Evaluation Metrics**: - **Exact Match (EM)** and **BLEU**: Traditional evaluation metrics. - **EM - trim** and **BLEU - trim**: New variants for more accurate measurement of the generated results. ### Experimental Results 1. **Influence of Different Prompts and Temperatures**: - When the temperature is set to 0, ChatGPT performs best. - As the temperature increases, ChatGPT's performance decreases significantly. - Providing detailed scenario descriptions and requirement information can significantly improve ChatGPT's performance. 2. **Comparison with Existing Methods**: - ChatGPT performs better than the existing state - of - the - art tool CodeReviewer in code optimization tasks. - Specifically, ChatGPT has achieved higher EM and BLEU scores on the new data set. 3. **Advantages and Disadvantages**: - ChatGPT performs well when dealing with simple and clear code modification tasks. - When dealing with documentation and function optimization tasks, ChatGPT performs poorly, mainly due to lack of domain knowledge, unclear position in review comments, and unclear changes. 4. **Improvement Strategies**: - Improve the quality of review comments. - Use more advanced large - language models (such as GPT - 4). ### Conclusion This study shows the potential of ChatGPT in code optimization tasks through empirical analysis and points out its advantages and limitations in specific tasks. Future research directions include...