Abstract:As LLM-based agents exhibit exceptional capabilities in addressing complex problems, there is a growing focus on developing coding agents to tackle increasingly sophisticated tasks. Despite their promising performance, these coding agents often produce programs or modifications that contain runtime errors, which can cause code failures and are difficult for static analysis tools to detect. Enhancing the ability of coding agents to statically identify such errors could significantly improve their overall performance. In this work, we introduce Execution-free Runtime Error Detection for COding Agents (REDO), a method that integrates LLMs with static analysis tools to detect runtime errors for coding agents, without code execution. Additionally, we propose a benchmark task, SWE-Bench-Error-Detection (SWEDE), based on SWE-Bench (lite), to evaluate error detection in repository-level problems with complex external dependencies. Finally, through both quantitative and qualitative analyses across various error detection tasks, we demonstrate that REDO outperforms current state-of-the-art methods by achieving a 11.0% higher accuracy and 9.1% higher weighted F1 score; and provide insights into the advantages of incorporating LLMs for error detection.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem that coding agents often introduce runtime errors when generating or modifying code. Although large - language models (LLMs) and LLM - based coding agents perform well in handling complex tasks, the code they generate often contains runtime errors (such as `SyntaxError`, `AttributeError` and `TypeError`), which can cause the code to fail and are difficult to detect by static analysis tools. Specifically, the paper focuses on the following key issues: 1. **Detection of runtime errors**: - There may be runtime errors in the code generated by coding agents, which can affect the correctness and reliability of the code. - Static analysis tools have limitations in detecting certain types of runtime errors (such as `TypeError`), because these errors usually depend on specific input conditions, and static analysis cannot reason about these input conditions. 2. **Error detection without execution**: - Although dynamic analysis methods can detect runtime errors, they need to execute the code, which may bring security, privacy and technical challenges. Moreover, dynamic analysis can only detect one error at a time, increasing the detection cost. - The paper proposes a method for detecting runtime errors without executing the code, called **REDO** (Execution - free Runtime Error Detection for CODing Agents). 3. **Evaluation and benchmarking**: - To evaluate the performance of coding agents in detecting runtime errors, the paper proposes a new benchmark task **SWE - Bench - Error - Detection (SWEDE)**, which is built based on SWE - Bench (lite) and involves complex external dependencies. - Through quantitative and qualitative analysis, the paper shows the superior performance of REDO in various error - detection tasks. Compared with existing methods, REDO has an accuracy improvement of 11.0% and a weighted F1 - score improvement of 9.1%. ### Overview of the solution The paper proposes a method that combines static analysis tools and large - language models (LLM), called **REDO**. REDO includes two main stages: 1. **Differential analysis**: - Use static analysis tools (such as PyRight) to analyze the original code and the modified code separately, and compare the differences between them to detect newly introduced runtime errors. - This method can reduce the false positives generated by static analysis tools and focus on new errors introduced by modifications. 2. **LLM - based detection**: - For potential errors that static analysis tools fail to detect, use LLM for further reasoning. LLM can understand the problem statement and modification patches, predict possible input contexts, and identify runtime errors missed by static analysis tools. In this way, REDO achieves a balance between the reliability of static analysis tools and the extensive detection capabilities of LLM, thus detecting runtime errors more effectively. ### Summary The core objective of this paper is to improve the quality of code generated by coding agents, especially by developing a method for detecting runtime errors without executing the code, thereby reducing the risk of code failure and improving development efficiency.

REDO: Execution-Free Runtime Error Detection for COding Agents

RedCode: Risky Code Execution and Generation Benchmark for Code Agents

DOCE: Finding the Sweet Spot for Execution-Based Code Generation

Reasoning Runtime Behavior of a Program with LLM: How Far Are We?

RGD: Multi-LLM Based Agent Debugger via Refinement and Generation Guidance

Frustrated with Code Quality Issues? LLMs can Help!

Co-Learning: Code Learning for Multi-Agent Reinforcement Collaborative Framework with Conversational Natural Language Interfaces

Self-Edit: Fault-Aware Code Editor for Code Generation

Agentless: Demystifying LLM-based Software Engineering Agents

Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting

MarsCode Agent: AI-native Automated Bug Fixing

ROCODE: Integrating Backtracking Mechanism and Program Analysis in Large Language Models for Code Generation

LLMs as Continuous Learners: Improving the Reproduction of Defective Code in Software Issues

Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection

Enhancing the Code Debugging Ability of LLMs via Communicative Agent Based Data Refinement

A Unified Debugging Approach via LLM-Based Multi-Agent Synergy

DRC-Coder: Automated DRC Checker Code Generation Using LLM Autonomous Agent

An Empirical Study on LLM-based Agents for Automated Bug Fixing

CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges

AI-powered Code Review with LLMs: Early Results