Antonis Antoniades,Albert Örwall,Kexun Zhang,Yuxi Xie,Anirudh Goyal,William Wang
Abstract:Software engineers operating in complex and dynamic environments must continuously adapt to evolving requirements, learn iteratively from experience, and reconsider their approaches based on new insights. However, current large language model (LLM)-based software agents often rely on rigid processes and tend to repeat ineffective actions without the capacity to evaluate their performance or adapt their strategies over time. To address these challenges, we propose SWE-Search, a multi-agent framework that integrates Monte Carlo Tree Search (MCTS) with a self-improvement mechanism to enhance software agents' performance on repository-level software tasks. SWE-Search extends traditional MCTS by incorporating a hybrid value function that leverages LLMs for both numerical value estimation and qualitative evaluation. This enables self-feedback loops where agents iteratively refine their strategies based on both quantitative numerical evaluations and qualitative natural language assessments of pursued trajectories. The framework includes a SWE-Agent for adaptive exploration, a Value Agent for iterative feedback, and a Discriminator Agent that facilitates multi-agent debate for collaborative decision-making. Applied to the SWE-bench benchmark, our approach demonstrates a 23% relative improvement in performance across five models compared to standard open-source agents without MCTS. Our analysis reveals how performance scales with increased search depth and identifies key factors that facilitate effective self-evaluation in software agents. This work highlights the potential of self-evaluation driven search techniques to enhance agent reasoning and planning in complex, dynamic software engineering environments.
What problem does this paper attempt to address?
This paper attempts to address the limitations of current large - language models (LLMs) in software engineering tasks. Specifically, existing LLM - based software agents often rely on rigid processes when handling complex, long - term tasks, are prone to getting stuck in repetitive and ineffective operations, and lack the ability to evaluate their own performance or adjust strategies over time. These issues limit their effectiveness in dynamic and complex software development environments.
To address these challenges, the authors propose the SWE - Search framework. SWE - Search is a multi - agent system designed to enhance the task performance of software agents at the codebase level in the following ways:
1. **Flexible Exploration and Adaptation**: The SWE - Agent can flexibly switch between different actions such as planning, searching, and editing in a dynamic environment to adapt to changing information and requirements.
2. **Iterative Learning through Feedback**: By combining the Monte Carlo Tree Search (MCTS) planning module and the Value Agent, SWE - Search can strike a balance between exploration and exploitation and iteratively improve the decision - making process through quantitative and qualitative feedback.
3. **Collaborative Decision - Making**: The introduction of the Discriminator Agent promotes multi - agent debate, ensuring that the final decision undergoes rigorous evaluation and discussion, simulating the collaborative decision - making process of an engineer team in the real world.
Through these mechanisms, SWE - Search aims to replicate the adaptability, iterative learning ability, and collaborative decision - making ability of human engineers when facing complex problems, thereby enhancing the performance of software agents in dynamic and complex environments. Experimental results show that SWE - Search achieves a 23% relative performance improvement compared to standard open - source agents in the SWE - bench benchmark, demonstrating its effectiveness and potential.
### Formula Summary
The key formulas involved in the paper include the modified UCT (Upper Confidence Bound for Trees) formula for selecting expansion nodes:
\[
\text{UCT}(s, a)=V(s, a)+C_s\sqrt{\frac{\ln N(s)}{N(s, a)}}+\alpha e^{-\beta(d - 1)}-\gamma\sqrt{d}
\]
where:
- \( V(s, a) \) is the value estimate of the state - action pair,
- \( N(s, a) \) is the number of visits to the state - action pair \((s, a)\),
- \( N(s) \) is the number of visits to state \( s \),
- \( d \) is the depth of the node in the search tree,
- \( C_s, \alpha, \beta, \gamma \) are constants that control exploration, exploitation, and depth - related rewards and punishments.
This formula enables the algorithm to exhibit appropriate behavior in different search stages by balancing exploration and exploitation while introducing early - exploration rewards and late - depth penalties.