Shou Li,Andrey Kan,Laurent Callot,Bhavana Bhasker,Muhammad Shihab Rashid,Timothy B Esler
Abstract:As LLM-based agents exhibit exceptional capabilities in addressing complex problems, there is a growing focus on developing coding agents to tackle increasingly sophisticated tasks. Despite their promising performance, these coding agents often produce programs or modifications that contain runtime errors, which can cause code failures and are difficult for static analysis tools to detect. Enhancing the ability of coding agents to statically identify such errors could significantly improve their overall performance. In this work, we introduce Execution-free Runtime Error Detection for COding Agents (REDO), a method that integrates LLMs with static analysis tools to detect runtime errors for coding agents, without code execution. Additionally, we propose a benchmark task, SWE-Bench-Error-Detection (SWEDE), based on SWE-Bench (lite), to evaluate error detection in repository-level problems with complex external dependencies. Finally, through both quantitative and qualitative analyses across various error detection tasks, we demonstrate that REDO outperforms current state-of-the-art methods by achieving a 11.0% higher accuracy and 9.1% higher weighted F1 score; and provide insights into the advantages of incorporating LLMs for error detection.
What problem does this paper attempt to address?
### What problem does this paper attempt to solve?
This paper aims to solve the problem that coding agents often introduce runtime errors when generating or modifying code. Although large - language models (LLMs) and LLM - based coding agents perform well in handling complex tasks, the code they generate often contains runtime errors (such as `SyntaxError`, `AttributeError` and `TypeError`), which can cause the code to fail and are difficult to detect by static analysis tools.
Specifically, the paper focuses on the following key issues:
1. **Detection of runtime errors**:
- There may be runtime errors in the code generated by coding agents, which can affect the correctness and reliability of the code.
- Static analysis tools have limitations in detecting certain types of runtime errors (such as `TypeError`), because these errors usually depend on specific input conditions, and static analysis cannot reason about these input conditions.
2. **Error detection without execution**:
- Although dynamic analysis methods can detect runtime errors, they need to execute the code, which may bring security, privacy and technical challenges. Moreover, dynamic analysis can only detect one error at a time, increasing the detection cost.
- The paper proposes a method for detecting runtime errors without executing the code, called **REDO** (Execution - free Runtime Error Detection for CODing Agents).
3. **Evaluation and benchmarking**:
- To evaluate the performance of coding agents in detecting runtime errors, the paper proposes a new benchmark task **SWE - Bench - Error - Detection (SWEDE)**, which is built based on SWE - Bench (lite) and involves complex external dependencies.
- Through quantitative and qualitative analysis, the paper shows the superior performance of REDO in various error - detection tasks. Compared with existing methods, REDO has an accuracy improvement of 11.0% and a weighted F1 - score improvement of 9.1%.
### Overview of the solution
The paper proposes a method that combines static analysis tools and large - language models (LLM), called **REDO**. REDO includes two main stages:
1. **Differential analysis**:
- Use static analysis tools (such as PyRight) to analyze the original code and the modified code separately, and compare the differences between them to detect newly introduced runtime errors.
- This method can reduce the false positives generated by static analysis tools and focus on new errors introduced by modifications.
2. **LLM - based detection**:
- For potential errors that static analysis tools fail to detect, use LLM for further reasoning. LLM can understand the problem statement and modification patches, predict possible input contexts, and identify runtime errors missed by static analysis tools.
In this way, REDO achieves a balance between the reliability of static analysis tools and the extensive detection capabilities of LLM, thus detecting runtime errors more effectively.
### Summary
The core objective of this paper is to improve the quality of code generated by coding agents, especially by developing a method for detecting runtime errors without executing the code, thereby reducing the risk of code failure and improving development efficiency.