Evaluating Software Development Agents: Patch Patterns, Code Quality, and Issue Complexity in Real-World GitHub Scenarios

Zhi Chen,Lingxiao Jiang
2024-10-16
Abstract:In recent years, AI-based software engineering has progressed from pre-trained models to advanced agentic workflows, with Software Development Agents representing the next major leap. These agents, capable of reasoning, planning, and interacting with external environments, offer promising solutions to complex software engineering tasks. However, while much research has evaluated code generated by large language models (LLMs), comprehensive studies on agent-generated patches, particularly in real-world settings, are lacking. This study addresses that gap by evaluating 4,892 patches from 10 top-ranked agents on 500 real-world GitHub issues from SWE-Bench Verified, focusing on their impact on code quality. Our analysis shows no single agent dominated, with 170 issues unresolved, indicating room for improvement. Even for patches that passed unit tests and resolved issues, agents made different file and function modifications compared to the gold patches from repository developers, revealing limitations in the benchmark's test case coverage. Most agents maintained code reliability and security, avoiding new bugs or vulnerabilities; while some agents increased code complexity, many reduced code duplication and minimized code smells. Finally, agents performed better on simpler codebases, suggesting that breaking complex tasks into smaller sub-tasks could improve effectiveness. This study provides the first comprehensive evaluation of agent-generated patches on real-world GitHub issues, offering insights to advance AI-driven software development.
Software Engineering,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the quality of patches generated by Software Development Agents in real - world GitHub issues. Specifically, the research aims to fill the gap in existing research, that is, the lack of a comprehensive evaluation of agent - generated patches in real - scene situations. The following are the specific problems that the paper attempts to solve: 1. **Patch Pattern Analysis**: - Research the patch patterns used by current software development agents when solving real - world GitHub issues. - Compare the patches generated by agents with the golden patches created by official repository developers, and explore whether the agents modify similar files and functions or adopt different modification methods. 2. **Code Quality Impact**: - Evaluate the impact of agent - generated patches on the reliability, security, and maintainability of the codebase. - Analyze whether these patches introduce or solve code smells, vulnerabilities, bugs, and increase code complexity or duplication. 3. **Differences between Solved and Unsolved Problems**: - Compare the solved and unsolved GitHub issues, and identify how factors such as problem statement complexity, codebase size, and solution effort affect the performance of agents. - Provide improvement suggestions to improve the performance of agents in handling more complex real - world tasks. Through these evaluations, the paper hopes to provide valuable insights for AI - driven software development and reveal the limitations in the SWE - Bench benchmark, such as its unit tests failing to fully cover all modified parts. Ultimately, the research results contribute to the advancement and application of software development agent technology. ### Main Contributions - **First Comprehensive Evaluation**: This is the first comprehensive evaluation of the quality of patches generated by software development agents in real - world GitHub issues. - **Reliability and Security Analysis**: Compare the differences between agent - generated patches and human - written patches in terms of reliability, security, and maintainability. - **Benchmark Limitations**: Point out the limitations in the SWE - bench benchmark, that is, its unit tests fail to fully cover all modified parts. - **Improvement Suggestions**: By comparing solved and unsolved problems, propose improvement suggestions to enhance the performance of agents in more complex real - world tasks. - **Data Sharing**: Publicly share datasets and scripts to promote further research and ensure the reproducibility of experiments.