Impact of LLM-based Review Comment Generation in Practice: A Mixed Open-/Closed-source User Study

Doriane Olewicki,Leuson Da Silva,Suhaib Mujahid,Arezou Amini,Benjamin Mah,Marco Castelluccio,Sarra Habchi,Foutse Khomh,Bram Adams

2024-11-12

Abstract:We conduct a large-scale empirical user study in a live setup to evaluate the acceptance of LLM-generated comments and their impact on the review process. This user study was performed in two organizations, Mozilla (which has its codebase available as open source) and Ubisoft (fully closed-source). Inside their usual review environment, participants were given access to RevMate, an LLM-based assistive tool suggesting generated review comments using an off-the-shelf LLM with Retrieval Augmented Generation to provide extra code and review context, combined with LLM-as-a-Judge, to auto-evaluate the generated comments and discard irrelevant cases. Based on more than 587 patch reviews provided by RevMate, we observed that 8.1% and 7.2%, respectively, of LLM-generated comments were accepted by reviewers in each organization, while 14.6% and 20.5% other comments were still marked as valuable as review or development tips. Refactoring-related comments are more likely to be accepted than Functional comments (18.2% and 18.6% compared to 4.8% and 5.2%). The extra time spent by reviewers to inspect generated comments or edit accepted ones (36/119), yielding an overall median of 43s per patch, is reasonable. The accepted generated comments are as likely to yield future revisions of the revised patch as human-written comments (74% vs 73% at chunk-level).

Software Engineering

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the acceptance of code review comments generated by large - language models (LLMs) in the actual development workflow and their impact on the review process. Specifically, the researchers hope to answer the following questions through this study: 1. **How often do reviewers accept comments generated by LLM - based methods?** - The study shows that in Mozilla and Ubisoft, 8.1% and 7.2% of the automatically generated comments were accepted by reviewers respectively, and another 23% and 28.3% of the comments were considered valuable. 2. **What is the relationship between comment categories and acceptance rates?** - The study found that refactoring - related comments are more likely to be accepted than functional comments (18.2% and 18.6% in Mozilla and Ubisoft respectively, while functional comments are 4.8% and 5.2%). 3. **What is the impact of using LLM - generated comments on the code review workflow?** - Although the review time is extended due to the need for additional evaluation of the generated comments, the average review time per comment is 43 seconds, which is acceptable. In addition, 37/119 of the accepted comments were edited, of which 25/37 were just shortened. 4. **What is the impact of using LLM - generated comments on the patch review process?** - The study shows that the number of codebase changes caused by the accepted generated comments is comparable to that of human - generated comments (74% and 73% in Ubisoft respectively). At the same time, the generated comments trigger fewer follow - up developer comments (23% compared to 34%). To answer these questions, the researchers designed a large - scale user study and conducted a six - week study in two different types of organizations (the open - source Mozilla and the closed - source Ubisoft). They developed an LLM - assisted tool named RevMate, which can be easily integrated into modern review environments and uses techniques such as retrieval - augmented generation (RAG) and LLM - as - a - Judge to generate and evaluate code review comments. In this way, the researchers not only evaluated the actual effects of LLM - generated comments but also explored the performance differences of these comments in different environments, thus providing valuable insights for future code review automation.

Impact of LLM-based Review Comment Generation in Practice: A Mixed Open-/Closed-source User Study

Behind Linus's law: A preliminary analysis of open source software peer review practices in Mozilla and Python

Automating Patch Set Generation from Code Review Comments Using Large Language Models

Identifying Inaccurate Descriptions in LLM-generated Code Comments via Test Execution

AUGER: Automatically Generating Review Comments with Pre-training Models

MARG: Multi-Agent Review Generation for Scientific Papers

Prompting and Fine-tuning Large Language Models for Automated Code Review Comment Generation

Self-Improving Customer Review Response Generation Based on LLMs

AI-powered Code Review with LLMs: Early Results

AI-Driven Review Systems: Evaluating LLMs in Scalable and Bias-Aware Academic Reviews

LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing

An Empirical Study on Code Review Activity Prediction and Its Impact in Practice

Deep Learning-based Code Reviews: A Paradigm Shift or a Double-Edged Sword?

LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation

Fighting reviewer fatigue or amplifying bias? Considerations and recommendations for use of ChatGPT and other large language models in scholarly peer review

Reviewer2: Optimizing Review Generation Through Prompt Generation

Software Vulnerability and Functionality Assessment using LLMs

Can LLMs Patch Security Issues?

CritiqueLLM: Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation

Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation

Benchmarking LLMs' Judgments with No Gold Standard