Abstract:In this paper, we examine how large language models (LLMs) solve multi-step problems under a language agent framework with three components: a generator, a discriminator, and a planning method. We investigate the practical utility of two advanced planning methods, iterative correction and tree search. We present a comprehensive analysis of how discrimination accuracy affects the overall performance of agents when using these two methods or a simpler method, re-ranking. Experiments on two tasks, text-to-SQL parsing and mathematical reasoning, show that: (1) advanced planning methods demand discriminators with at least 90% accuracy to achieve significant improvements over re-ranking; (2) current LLMs' discrimination abilities have not met the needs of advanced planning methods to achieve such improvements; (3) with LLM-based discriminators, advanced planning methods may not adequately balance accuracy and efficiency. For example, compared to the other two methods, tree search is at least 10--20 times slower but leads to negligible performance gains, which hinders its real-world applications. Code and data are available at <a class="link-external link-https" href="https://github.com/OSU-NLP-Group/llm-planning-eval" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to explore the effectiveness of planning methods in large language models (LLMs) for multi-step tasks, particularly within a framework composed of generators, discriminators, and planning methods. Specifically, the authors investigate the practical utility of two advanced planning methods (iterative refinement and tree search) compared to a simple method (re-ranking) and focus on how the accuracy of the discriminator affects the overall performance of these planning methods. ### Main Research Questions 1. **How does the accuracy of the discriminator affect the overall performance of language agents using different planning methods?** - The study finds that advanced planning methods (such as iterative refinement and tree search) require the discriminator to achieve at least 90% accuracy to significantly outperform the simple re-ranking method. - Current LLMs have not yet reached this level of discriminative capability, resulting in advanced planning methods not showing significant advantages in practical applications. 2. **Can LLM-based discriminators correctly evaluate the actions of language agents in practical settings?** - Through experiments, the authors find that environmental feedback can significantly improve the accuracy of LLMs' discrimination. For example, in the text-to-SQL parsing task, the accuracy of the discriminator increased by 30.2 percentage points, and in the mathematical reasoning task, it increased by 8.4 percentage points. - Nevertheless, even with improvements, LLM-based discriminators still struggle to meet the demands of advanced planning methods in some tasks, particularly in balancing accuracy and efficiency. ### Experimental Results - **Advanced Planning Methods Require High-Accuracy Discriminators**: - Iterative refinement and tree search methods only show significant performance improvements when the discriminator's accuracy exceeds 90%. - In practical applications, current LLMs' discriminators have not met this standard, causing these advanced methods to not significantly outperform the simple re-ranking method. - **Improvements in LLM-Based Discriminators**: - Environmental feedback (such as program executability checks and execution results) can significantly improve the accuracy of the discriminator. - Improved discriminators show better performance in multiple tasks, especially in text-to-SQL parsing and mathematical reasoning tasks. - **Trade-off Between Accuracy and Efficiency**: - While the tree search method can improve performance in some cases, its running speed is much slower than other methods, which may hinder its deployment in practical applications. - The inference time of the iterative refinement method increases with the accuracy of the discriminator, indicating that developing efficient and accurate planning methods remains a key issue. ### Conclusion Through systematic analysis and experiments, this paper reveals that the practical utility of advanced planning methods in multi-step tasks is limited by the accuracy of the discriminator. Although environmental feedback can significantly improve the accuracy of the discriminator, these advanced methods still face challenges in balancing accuracy and efficiency in practical applications. Future research should further explore how to improve LLM-based discriminators to enhance their performance in advanced planning methods.

When is Tree Search Useful for LLM Planning? It Depends on the Discriminator

Tree-Planner: Efficient Close-loop Task Planning with Large Language Models

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games

Can We Rely on LLM Agents to Draft Long-Horizon Plans? Let's Take TravelPlanner as an Example

Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training

Tree Search for Language Model Agents

LLM as BT-Planner: Leveraging LLMs for Behavior Tree Generation in Robot Task Planning

Understanding the planning of LLM agents: A survey

Can LLMs Fix Issues with Reasoning Models? Towards More Likely Models for AI Planning

On the Planning Abilities of Large Language Models (A Critical Investigation with a Proposed Benchmark)

Exploring and Benchmarking the Planning Capabilities of Large Language Models

On the Planning Abilities of Large Language Models : A Critical Investigation

Advancing Process Verification for Large Language Models via Tree-Based Preference Learning

LLM Tree Search

Autonomous Tree-search Ability of Large Language Models

AdaPlanner: Adaptive Planning from Feedback with Language Models

Testing and Understanding Erroneous Planning in LLM Agents through Synthesized User Inputs

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

Can LLMs plan paths with extra hints from solvers?

Unlocking Large Language Model's Planning Capabilities with Maximum Diversity Fine-tuning