When is Tree Search Useful for LLM Planning? It Depends on the Discriminator

Ziru Chen,Michael White,Raymond Mooney,Ali Payani,Yu Su,Huan Sun
2024-06-06
Abstract:In this paper, we examine how large language models (LLMs) solve multi-step problems under a language agent framework with three components: a generator, a discriminator, and a planning method. We investigate the practical utility of two advanced planning methods, iterative correction and tree search. We present a comprehensive analysis of how discrimination accuracy affects the overall performance of agents when using these two methods or a simpler method, re-ranking. Experiments on two tasks, text-to-SQL parsing and mathematical reasoning, show that: (1) advanced planning methods demand discriminators with at least 90% accuracy to achieve significant improvements over re-ranking; (2) current LLMs' discrimination abilities have not met the needs of advanced planning methods to achieve such improvements; (3) with LLM-based discriminators, advanced planning methods may not adequately balance accuracy and efficiency. For example, compared to the other two methods, tree search is at least 10--20 times slower but leads to negligible performance gains, which hinders its real-world applications. Code and data are available at <a class="link-external link-https" href="https://github.com/OSU-NLP-Group/llm-planning-eval" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to explore the effectiveness of planning methods in large language models (LLMs) for multi-step tasks, particularly within a framework composed of generators, discriminators, and planning methods. Specifically, the authors investigate the practical utility of two advanced planning methods (iterative refinement and tree search) compared to a simple method (re-ranking) and focus on how the accuracy of the discriminator affects the overall performance of these planning methods. ### Main Research Questions 1. **How does the accuracy of the discriminator affect the overall performance of language agents using different planning methods?** - The study finds that advanced planning methods (such as iterative refinement and tree search) require the discriminator to achieve at least 90% accuracy to significantly outperform the simple re-ranking method. - Current LLMs have not yet reached this level of discriminative capability, resulting in advanced planning methods not showing significant advantages in practical applications. 2. **Can LLM-based discriminators correctly evaluate the actions of language agents in practical settings?** - Through experiments, the authors find that environmental feedback can significantly improve the accuracy of LLMs' discrimination. For example, in the text-to-SQL parsing task, the accuracy of the discriminator increased by 30.2 percentage points, and in the mathematical reasoning task, it increased by 8.4 percentage points. - Nevertheless, even with improvements, LLM-based discriminators still struggle to meet the demands of advanced planning methods in some tasks, particularly in balancing accuracy and efficiency. ### Experimental Results - **Advanced Planning Methods Require High-Accuracy Discriminators**: - Iterative refinement and tree search methods only show significant performance improvements when the discriminator's accuracy exceeds 90%. - In practical applications, current LLMs' discriminators have not met this standard, causing these advanced methods to not significantly outperform the simple re-ranking method. - **Improvements in LLM-Based Discriminators**: - Environmental feedback (such as program executability checks and execution results) can significantly improve the accuracy of the discriminator. - Improved discriminators show better performance in multiple tasks, especially in text-to-SQL parsing and mathematical reasoning tasks. - **Trade-off Between Accuracy and Efficiency**: - While the tree search method can improve performance in some cases, its running speed is much slower than other methods, which may hinder its deployment in practical applications. - The inference time of the iterative refinement method increases with the accuracy of the discriminator, indicating that developing efficient and accurate planning methods remains a key issue. ### Conclusion Through systematic analysis and experiments, this paper reveals that the practical utility of advanced planning methods in multi-step tasks is limited by the accuracy of the discriminator. Although environmental feedback can significantly improve the accuracy of the discriminator, these advanced methods still face challenges in balancing accuracy and efficiency in practical applications. Future research should further explore how to improve LLM-based discriminators to enhance their performance in advanced planning methods.