Abstract:The automation of code review has been tackled by several researchers with the goal of reducing its cost. The adoption of deep learning in software engineering pushed the automation to new boundaries, with techniques imitating developers in generative tasks, such as commenting on a code change as a reviewer would do or addressing a reviewer's comment by modifying code. The performance of these techniques is usually assessed through quantitative metrics, e.g., the percentage of instances in the test set for which correct predictions are generated, leaving many open questions on the techniques' capabilities. For example, knowing that an approach is able to correctly address a reviewer's comment in 10% of cases is of little value without knowing what was asked by the reviewer: What if in all successful cases the code change required to address the comment was just the removal of an empty line? In this paper we aim at characterizing the cases in which three code review automation techniques tend to succeed or fail in the two above-described tasks. The study has a strong qualitative focus, with ∼105 man-hours of manual inspection invested in manually analyzing correct and wrong predictions generated by the three techniques, for a total of 2,291 inspected predictions. The output of this analysis are two taxonomies reporting, for each of the two tasks, the types of code changes on which the experimented techniques tend to succeed or to fail, pointing to areas for future work. A result of our manual analysis was also the identification of several issues in the datasets used to train and test the experimented techniques. Finally, we assess the importance of researching in techniques specialized for code review automation by comparing their performance with ChatGPT, a general purpose large language model, finding that ChatGPT struggles in commenting code as a human reviewer would do.

When to Stop Reviewing in Technology-Assisted Reviews

Stopping Methods for Technology Assisted Reviews based on Point Processes

Using Chao's Estimator as a Stopping Criterion for Technology-Assisted Review

RLStop: A Reinforcement Learning Stopping Method for TAR

Who Should Review This Change?: Putting Text and File Location Analyses Together for More Accurate Recommendations

Autonomy and Reliability of Continuous Active Learning for Technology-Assisted Review

Engineering Quality and Reliability in Technology-Assisted Review

Combining Counting Processes and Classification Improves a Stopping Rule for Technology Assisted Review

The SAFE procedure: a practical stopping heuristic for active learning-based screening in systematic reviews and meta-analyses

Evaluation of Seed Set Selection Approaches and Active Learning Strategies in Predictive Coding

A Method for Stopping Active Learning Based on Stabilizing Predictions and the Need for User-Adjustable Stopping

Empirical Evaluations of Active Learning Strategies in Legal Document Review

A Reproducibility Study of Goldilocks: Just-Right Tuning of BERT for TAR

An evaluation of the performance of stopping rules in AI‐aided screening for psychological meta‐analytical research

Dense Retrieval with Continuous Explicit Feedback for Systematic Review Screening Prioritisation

Code Review Automation: Strengths and Weaknesses of the State of the Art

Cutting Through the Clutter: The Potential of LLMs for Efficient Filtration in Systematic Literature Reviews

Improving Code Reviewer Recommendation: Accuracy, Latency, Workload, and Bystanders

Deep Learning-based Code Reviews: A Paradigm Shift or a Double-Edged Sword?

LLM-Assisted Relevance Assessments: When Should We Ask LLMs for Help?