Exploring and Benchmarking the Planning Capabilities of Large Language Models

Bernd Bohnet,Azade Nova,Aaron T Parisi,Kevin Swersky,Katayoon Goshvadi,Hanjun Dai,Dale Schuurmans,Noah Fiedel,Hanie Sedghi

2024-06-19

Abstract:We seek to elevate the planning capabilities of Large Language Models (LLMs)investigating four main directions. First, we construct a comprehensive benchmark suite encompassing both classical planning domains and natural language scenarios. This suite includes algorithms to generate instances with varying levels of difficulty, allowing for rigorous and systematic evaluation of LLM performance. Second, we investigate the use of in-context learning (ICL) to enhance LLM planning, exploring the direct relationship between increased context length and improved planning performance. Third, we demonstrate the positive impact of fine-tuning LLMs on optimal planning paths, as well as the effectiveness of incorporating model-driven search procedures. Finally, we investigate the performance of the proposed methods in out-of-distribution scenarios, assessing the ability to generalize to novel and unseen planning challenges.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

This paper mainly discusses how to improve the planning capability of large-scale language models (LLMs). The researchers have constructed a comprehensive benchmark test suite, including classic planning domains and natural language scenarios, to systematically evaluate the performance of LLMs. They explored the following four main directions: 1. They devised a method to generate instances of different difficulty levels for rigorous and systematic performance evaluation. 2. They investigated the impact of increasing the context length on enhancing the planning capability of LLMs (i.e., learning in context, ICL). 3. They demonstrated the effectiveness of fine-tuning LLMs to optimize planning paths and introduced model-driven search procedures. 4. They studied the generalization capability of the method in out-of-distribution scenarios, evaluating its adaptability to new planning challenges. The paper mentions that although LLMs have demonstrated planning capabilities in certain tasks, they may also produce invalid or erroneous plans in some simple scenarios. By using ICL, multi-instance learning, fine-tuning strategies, and incorporating search algorithms such as Monte Carlo Tree Search (MCTS), the researchers improved the planning performance of LLMs. The experiments show that a combination of guided models, multi-instance learning in long contexts, fine-tuning, and search strategies can significantly enhance the planning capability and have a certain degree of generalization on unseen problems. Furthermore, the study also compares formal planning domain definition language (PDDL) and natural language representation of planning tasks to evaluate the model's performance in different scenarios. The paper concludes by discussing the experimental setup, results, and proposing future directions, including further research on planning generalization capability.

Exploring and Benchmarking the Planning Capabilities of Large Language Models

On the Planning Abilities of Large Language Models (A Critical Investigation with a Proposed Benchmark)

On the Planning Abilities of Large Language Models : A Critical Investigation

PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change

Understanding the Capabilities of Large Language Models for Automated Planning

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

Large Language Models as Planning Domain Generators

NATURAL PLAN: Benchmarking LLMs on Natural Language Planning

What's the Plan? Evaluating and Developing Planning-Aware Techniques for Language Models

LASP: Surveying the State-of-the-Art in Large Language Model-Assisted AI Planning

EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios

Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal Reasoning

Query-Efficient Planning with Language Models

Improving Planning with Large Language Models: A Modular Agentic Architecture

On the Prospects of Incorporating Large Language Models (LLMs) in Automated Planning and Scheduling (APS)

Leveraging Environment Interaction for Automated PDDL Translation and Planning with Large Language Models

Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning

Learning to Plan with Natural Language

Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games

Understanding the planning of LLM agents: A survey