Abstract:Large Language Models have excelled in remarkable reasoning capabilities with advanced prompting techniques, but they fall short on tasks that require exploration, strategic foresight, and sequential decision-making. Recent works propose to utilize external programs to define search logic, such that LLMs can perform passive tree search to solve more challenging reasoning tasks. Though impressive results have been achieved, there are several fundamental limitations of these approaches. First, passive tree searches are not efficient as they usually require multiple rounds of LLM API calls to solve one single problem. Moreover, passive search methods are not flexible since they need task-specific program designs. Then a natural question arises: can we maintain the tree-search capability of LLMs without the aid of external programs, and can still generate responses that clearly demonstrate the process of a tree-structure search? To this end, we propose a new concept called autonomous tree-search ability of LLM, which can automatically generate a response containing search trajectories for the correct answer. Concretely, we perform search trajectories using capable LLM API via a fixed system prompt, allowing them to perform autonomous tree-search (ATS) right out of the box. Experiments on 4 puzzle games demonstrate our method can achieve huge improvements. The ATS-BFS method outperforms the Chain of Thought approach by achieving an average accuracy improvement of 33%. Compared to Tree of Thoughts, it requires 65.6% or 47.7% less GPT-api cost to attain a comparable level of accuracy. Moreover, we have collected data using the ATS prompt method and fine-tuned LLaMA. This approach yield a greater improvement compared to the ones fine-tuned on CoT data. Specifically, it outperforms CoT-tuned LLaMAs by an average of 40.6% and 38.5% for LLaMA2-7B and LLaMA2-13B, respectively.

Effective Large Language Model Debugging with Best-first Tree Search

LDB: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step

DebugBench: Evaluating Debugging Capability of Large Language Models

Teaching Large Language Models to Self-Debug

Enhancing the Code Debugging Ability of LLMs via Communicative Agent Based Data Refinement

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

RoT: Enhancing Large Language Models with Reflection on Search Trees

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

Debugging with Open-Source Large Language Models: An Evaluation

CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models

Training LLMs to Better Self-Debug and Explain Code

Leveraging Print Debugging to Improve Code Generation in Large Language Models

RGD: Multi-LLM Based Agent Debugger via Refinement and Generation Guidance

Autonomous Tree-search Ability of Large Language Models

Evaluating Diverse Large Language Models for Automatic and General Bug Reproduction

A Study on Training and Developing Large Language Models for Behavior Tree Generation

Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing

Large Language Model Guided Tree-of-Thought

A Deep Dive into Large Language Model Code Generation Mistakes: What and Why?

A Critical Review of Large Language Model on Software Engineering: An Example from ChatGPT and Automated Program Repair