Exploring and Benchmarking the Planning Capabilities of Large Language Models

Bernd Bohnet,Azade Nova,Aaron T Parisi,Kevin Swersky,Katayoon Goshvadi,Hanjun Dai,Dale Schuurmans,Noah Fiedel,Hanie Sedghi
2024-06-19
Abstract:We seek to elevate the planning capabilities of Large Language Models (LLMs)investigating four main directions. First, we construct a comprehensive benchmark suite encompassing both classical planning domains and natural language scenarios. This suite includes algorithms to generate instances with varying levels of difficulty, allowing for rigorous and systematic evaluation of LLM performance. Second, we investigate the use of in-context learning (ICL) to enhance LLM planning, exploring the direct relationship between increased context length and improved planning performance. Third, we demonstrate the positive impact of fine-tuning LLMs on optimal planning paths, as well as the effectiveness of incorporating model-driven search procedures. Finally, we investigate the performance of the proposed methods in out-of-distribution scenarios, assessing the ability to generalize to novel and unseen planning challenges.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper mainly discusses how to improve the planning capability of large-scale language models (LLMs). The researchers have constructed a comprehensive benchmark test suite, including classic planning domains and natural language scenarios, to systematically evaluate the performance of LLMs. They explored the following four main directions: 1. They devised a method to generate instances of different difficulty levels for rigorous and systematic performance evaluation. 2. They investigated the impact of increasing the context length on enhancing the planning capability of LLMs (i.e., learning in context, ICL). 3. They demonstrated the effectiveness of fine-tuning LLMs to optimize planning paths and introduced model-driven search procedures. 4. They studied the generalization capability of the method in out-of-distribution scenarios, evaluating its adaptability to new planning challenges. The paper mentions that although LLMs have demonstrated planning capabilities in certain tasks, they may also produce invalid or erroneous plans in some simple scenarios. By using ICL, multi-instance learning, fine-tuning strategies, and incorporating search algorithms such as Monte Carlo Tree Search (MCTS), the researchers improved the planning performance of LLMs. The experiments show that a combination of guided models, multi-instance learning in long contexts, fine-tuning, and search strategies can significantly enhance the planning capability and have a certain degree of generalization on unseen problems. Furthermore, the study also compares formal planning domain definition language (PDDL) and natural language representation of planning tasks to evaluate the model's performance in different scenarios. The paper concludes by discussing the experimental setup, results, and proposing future directions, including further research on planning generalization capability.