Impact of LLM-based Review Comment Generation in Practice: A Mixed Open-/Closed-source User Study

Doriane Olewicki,Leuson Da Silva,Suhaib Mujahid,Arezou Amini,Benjamin Mah,Marco Castelluccio,Sarra Habchi,Foutse Khomh,Bram Adams
2024-11-12
Abstract:We conduct a large-scale empirical user study in a live setup to evaluate the acceptance of LLM-generated comments and their impact on the review process. This user study was performed in two organizations, Mozilla (which has its codebase available as open source) and Ubisoft (fully closed-source). Inside their usual review environment, participants were given access to RevMate, an LLM-based assistive tool suggesting generated review comments using an off-the-shelf LLM with Retrieval Augmented Generation to provide extra code and review context, combined with LLM-as-a-Judge, to auto-evaluate the generated comments and discard irrelevant cases. Based on more than 587 patch reviews provided by RevMate, we observed that 8.1% and 7.2%, respectively, of LLM-generated comments were accepted by reviewers in each organization, while 14.6% and 20.5% other comments were still marked as valuable as review or development tips. Refactoring-related comments are more likely to be accepted than Functional comments (18.2% and 18.6% compared to 4.8% and 5.2%). The extra time spent by reviewers to inspect generated comments or edit accepted ones (36/119), yielding an overall median of 43s per patch, is reasonable. The accepted generated comments are as likely to yield future revisions of the revised patch as human-written comments (74% vs 73% at chunk-level).
Software Engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the acceptance of code review comments generated by large - language models (LLMs) in the actual development workflow and their impact on the review process. Specifically, the researchers hope to answer the following questions through this study: 1. **How often do reviewers accept comments generated by LLM - based methods?** - The study shows that in Mozilla and Ubisoft, 8.1% and 7.2% of the automatically generated comments were accepted by reviewers respectively, and another 23% and 28.3% of the comments were considered valuable. 2. **What is the relationship between comment categories and acceptance rates?** - The study found that refactoring - related comments are more likely to be accepted than functional comments (18.2% and 18.6% in Mozilla and Ubisoft respectively, while functional comments are 4.8% and 5.2%). 3. **What is the impact of using LLM - generated comments on the code review workflow?** - Although the review time is extended due to the need for additional evaluation of the generated comments, the average review time per comment is 43 seconds, which is acceptable. In addition, 37/119 of the accepted comments were edited, of which 25/37 were just shortened. 4. **What is the impact of using LLM - generated comments on the patch review process?** - The study shows that the number of codebase changes caused by the accepted generated comments is comparable to that of human - generated comments (74% and 73% in Ubisoft respectively). At the same time, the generated comments trigger fewer follow - up developer comments (23% compared to 34%). To answer these questions, the researchers designed a large - scale user study and conducted a six - week study in two different types of organizations (the open - source Mozilla and the closed - source Ubisoft). They developed an LLM - assisted tool named RevMate, which can be easily integrated into modern review environments and uses techniques such as retrieval - augmented generation (RAG) and LLM - as - a - Judge to generate and evaluate code review comments. In this way, the researchers not only evaluated the actual effects of LLM - generated comments but also explored the performance differences of these comments in different environments, thus providing valuable insights for future code review automation.