Abstract:In recent years, AI-based software engineering has progressed from pre-trained models to advanced agentic workflows, with Software Development Agents representing the next major leap. These agents, capable of reasoning, planning, and interacting with external environments, offer promising solutions to complex software engineering tasks. However, while much research has evaluated code generated by large language models (LLMs), comprehensive studies on agent-generated patches, particularly in real-world settings, are lacking. This study addresses that gap by evaluating 4,892 patches from 10 top-ranked agents on 500 real-world GitHub issues from SWE-Bench Verified, focusing on their impact on code quality. Our analysis shows no single agent dominated, with 170 issues unresolved, indicating room for improvement. Even for patches that passed unit tests and resolved issues, agents made different file and function modifications compared to the gold patches from repository developers, revealing limitations in the benchmark's test case coverage. Most agents maintained code reliability and security, avoiding new bugs or vulnerabilities; while some agents increased code complexity, many reduced code duplication and minimized code smells. Finally, agents performed better on simpler codebases, suggesting that breaking complex tasks into smaller sub-tasks could improve effectiveness. This study provides the first comprehensive evaluation of agent-generated patches on real-world GitHub issues, offering insights to advance AI-driven software development.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the quality of patches generated by Software Development Agents in real - world GitHub issues. Specifically, the research aims to fill the gap in existing research, that is, the lack of a comprehensive evaluation of agent - generated patches in real - scene situations. The following are the specific problems that the paper attempts to solve: 1. **Patch Pattern Analysis**: - Research the patch patterns used by current software development agents when solving real - world GitHub issues. - Compare the patches generated by agents with the golden patches created by official repository developers, and explore whether the agents modify similar files and functions or adopt different modification methods. 2. **Code Quality Impact**: - Evaluate the impact of agent - generated patches on the reliability, security, and maintainability of the codebase. - Analyze whether these patches introduce or solve code smells, vulnerabilities, bugs, and increase code complexity or duplication. 3. **Differences between Solved and Unsolved Problems**: - Compare the solved and unsolved GitHub issues, and identify how factors such as problem statement complexity, codebase size, and solution effort affect the performance of agents. - Provide improvement suggestions to improve the performance of agents in handling more complex real - world tasks. Through these evaluations, the paper hopes to provide valuable insights for AI - driven software development and reveal the limitations in the SWE - Bench benchmark, such as its unit tests failing to fully cover all modified parts. Ultimately, the research results contribute to the advancement and application of software development agent technology. ### Main Contributions - **First Comprehensive Evaluation**: This is the first comprehensive evaluation of the quality of patches generated by software development agents in real - world GitHub issues. - **Reliability and Security Analysis**: Compare the differences between agent - generated patches and human - written patches in terms of reliability, security, and maintainability. - **Benchmark Limitations**: Point out the limitations in the SWE - bench benchmark, that is, its unit tests fail to fully cover all modified parts. - **Improvement Suggestions**: By comparing solved and unsolved problems, propose improvement suggestions to enhance the performance of agents in more complex real - world tasks. - **Data Sharing**: Publicly share datasets and scripts to promote further research and ensure the reproducibility of experiments.

Evaluating Software Development Agents: Patch Patterns, Code Quality, and Issue Complexity in Real-World GitHub Scenarios

Agentless: Demystifying LLM-based Software Engineering Agents

CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges

SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents

MarsCode Agent: AI-native Automated Bug Fixing

CodeAgent: Autonomous Communicative Agents for Code Review

Agent-as-a-Judge: Evaluate Agents with Agents

An Empirical Study on LLM-based Agents for Automated Bug Fixing

AutoCodeRover: Autonomous Program Improvement

Autonomous Agents in Software Development: A Vision Paper

How Different Is It Between Machine-Generated And Developer-Provided Patches?

AI Agents That Matter

HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale

How Different is It Between Machine-Generated and Developer-Provided Patches? : an Empirical Study on the Correct Patches Generated by Automated Program Repair Techniques

Improving Performance of Commercially Available AI Products in a Multi-Agent Configuration

CodeR: Issue Resolving with Multi-Agent and Task Graphs

Static Code Analysis in the AI Era: An In-depth Exploration of the Concept, Function, and Potential of Intelligent Code Analysis Agents

Fixing Security Vulnerabilities with AI in OSS-Fuzz

How to Understand Whole Software Repository?

The Inversive Relationship Between Bugs and Patches: An Empirical Study

Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT