Generation-based Code Review Automation: How Far Are We?

Xin Zhou,Kisub Kim,Bowen Xu,DongGyun Han,Junda He,David Lo
DOI: https://doi.org/10.48550/arXiv.2303.07221
2023-03-13
Abstract:Code review is an effective software quality assurance activity; however, it is labor-intensive and time-consuming. Thus, a number of generation-based automatic code review (ACR) approaches have been proposed recently, which leverage deep learning techniques to automate various activities in the code review process (e.g., code revision generation and review comment generation). We find the previous works carry three main limitations. First, the ACR approaches have been shown to be beneficial in each work, but those methods are not comprehensively compared with each other to show their superiority over their peer ACR approaches. Second, general-purpose pre-trained models such as CodeT5 are proven to be effective in a wide range of Software Engineering (SE) tasks. However, no prior work has investigated the effectiveness of these models in ACR tasks yet. Third, prior works heavily rely on the Exact Match (EM) metric which only focuses on the perfect predictions and ignores the positive progress made by incomplete answers. To fill such a research gap, we conduct a comprehensive study by comparing the effectiveness of recent ACR tools as well as the general-purpose pre-trained models. The results show that a general-purpose pre-trained model CodeT5 can outperform other models in most cases. Specifically, CodeT5 outperforms the prior state-of-the-art by 13.4\%--38.9\% in two code revision generation tasks. In addition, we introduce a new metric namely Edit Progress (EP) to quantify the partial progress made by ACR tools. The results show that the rankings of models for each task could be changed according to whether EM or EP is being utilized. Lastly, we derive several insightful lessons from the experimental results and reveal future research directions for generation-based code review automation.
Software Engineering
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly include three aspects: 1. **Lack of comprehensive comparison**: Although the existing automatic code review (ACR) methods have been proven to be beneficial in their respective studies, there is a lack of comprehensive comparison among these methods, which makes it difficult for practitioners to judge which method is most suitable for a specific task. 2. **Unassessed general pre - training models**: General pre - training models (such as CodeT5) perform well in a wide range of software engineering tasks, but their performance in ACR tasks has not been studied yet. 3. **Over - strict evaluation metrics**: Existing ACR work mainly relies on the strict evaluation metric of Exact Match (EM), which only focuses on completely correct predictions and ignores the positive progress in incomplete answers. To fill these research gaps, the paper conducted large - scale comprehensive experiments to compare the effects of recent ACR tools and general pre - training models on a unified benchmark. Specifically, the goals of the paper are: - **Evaluate the effects of existing ACR tools and pre - training models**: By comparing the performance of different models on three tasks: Code Revision Before Review (CRB), Code Revision After Review (CRA), and Review Comment Generation (RCG). - **Introduce new evaluation metrics**: Propose a new evaluation metric - Edit Progress (EP), which is used to quantify the progress made by the generated code in approaching the correct code relative to the initial submitted code. - **Reveal future research directions**: Summarize valuable insights from the experimental results and point out the future research directions in the field of generative code review automation. Through these efforts, the paper aims to provide a more comprehensive understanding of generative code review automation and provide guidance for future research.