Abstract:While there have been extensive studies in code generation by large language models (LLM), where benchmarks like HumanEval have been surpassed with an impressive 96.3% success rate, these benchmarks predominantly judge a model's performance on basic function-level code generation and lack the critical thinking and concept of scope required of real-world scenarios such as solving GitHub issues. This research introduces the application of the Tree of Thoughts (ToT) language model reasoning framework for enhancing the decision-making and problem-solving abilities of LLMs for this complex task. Compared to traditional input-output (IO) prompting and Retrieval Augmented Generation (RAG) techniques, ToT is designed to improve performance by facilitating a structured exploration of multiple reasoning trajectories and enabling self-assessment of potential solutions. We experimentally deploy ToT in tackling a Github issue contained within an instance of the SWE-bench. However, our results reveal that the ToT framework alone is not enough to give LLMs the critical reasoning capabilities to outperform existing methods. In this paper we analyze the potential causes of these shortcomings and identify key areas for improvement such as deepening the thought process and introducing agentic capabilities. The insights of this research are aimed at informing future directions for refining the application of ToT and better harnessing the potential of LLMs in real-world problem-solving scenarios.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to explore whether the **Tree of Thoughts (ToT)** framework can be used to solve code problems on GitHub. Specifically, researchers attempt to apply ToT to more complex real - world tasks, such as solving GitHub issues. These tasks not only require code generation but also stronger critical reasoning abilities. #### Main problem description: 1. **Limitations of existing benchmarks**: - Current large - scale language models (LLM) perform well in benchmarks (such as HumanEval), but these benchmarks mainly evaluate the code - generation ability at the basic function level and lack critical thinking and scope understanding required in real - world scenarios. 2. **Complexity of GitHub issue resolution**: - Solving GitHub issues requires an overall understanding of the entire codebase, and when modifying the code, the dependencies in the codebase need to be considered. This requires LLM to have stronger decision - making and problem - solving abilities. 3. **Application of the ToT framework**: - Researchers introduced the ToT framework to enhance the decision - making and problem - solving abilities of LLM. ToT aims to improve the performance of LLM in complex tasks by structurally exploring multiple reasoning paths and conducting self - evaluations. 4. **Experimental results and analysis**: - The experimental results show that the ToT framework alone is not sufficient to make LLM outperform existing methods in solving GitHub issues. Researchers analyzed the deficiencies of this framework and proposed directions for improvement, such as deepening the thinking process and introducing agentic capabilities. #### Conclusion: Although the ToT framework performs well in some simple tasks, it still needs further improvement when solving complex tasks such as GitHub issues. Future research should focus on how to optimize the ToT framework to make it better adapt to the requirements of real - world programming tasks. ### Key formulas (none) This article does not involve specific mathematical, physical, or chemical formulas, mainly about research methods and experimental designs in the fields of natural language processing and software engineering.

Can Github issues be solved with Tree Of Thoughts?

Large Language Model Guided Tree-of-Thought

Tree of Problems: Improving structured problem solving with compositionality

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

iToT: An Interactive System for Customized Tree-of-Thought Generation

Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning

Graph of Thoughts: Solving Elaborate Problems with Large Language Models

When Do Program-of-Thought Works for Reasoning?

Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination

Empowering Multi-step Reasoning across Languages via Tree-of-Thoughts

Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation

Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning

$T^2$ of Thoughts: Temperature Tree Elicits Reasoning in Large Language Models

How Do Humans Write Code? Large Models Do It the Same Way Too

Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Language Models

Boosting of Thoughts: Trial-and-Error Problem Solving with Large Language Models

The Art of SOCRATIC QUESTIONING: Recursive Thinking with Large Language Models

Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-thoughts Critic

Synergy-of-Thoughts: Eliciting Efficient Reasoning in Hybrid Language Models

Supervised Chain of Thought