Abstract:The ability to plan a course of action that achieves a desired state of affairs has long been considered a core competence of intelligent agents and has been an integral part of AI research since its inception. With the advent of large language models (LLMs), there has been considerable interest in the question of whether or not they possess such planning abilities. PlanBench, an extensible benchmark we developed in 2022, soon after the release of GPT3, has remained an important tool for evaluating the planning abilities of LLMs. Despite the slew of new private and open source LLMs since GPT3, progress on this benchmark has been surprisingly slow. OpenAI claims that their recent o1 (Strawberry) model has been specifically constructed and trained to escape the normal limitations of autoregressive LLMs--making it a new kind of model: a Large Reasoning Model (LRM). Using this development as a catalyst, this paper takes a comprehensive look at how well current LLMs and new LRMs do on PlanBench. As we shall see, while o1's performance is a quantum improvement on the benchmark, outpacing the competition, it is still far from saturating it. This improvement also brings to the fore questions about accuracy, efficiency, and guarantees which must be considered before deploying such systems.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to evaluate the capabilities of large language models (LLMs) and new large - reasoning models (LRMs) in planning tasks. Specifically, the researchers used PlanBench, a benchmarking tool they developed previously, to evaluate the performance of these models. The following are the core issues of the paper: 1. **Evaluating the planning capabilities of LLMs**: Since the release of GPT - 3 in 2022, although many new private and open - source LLMs have emerged, their progress in the PlanBench benchmark has been very slow. Therefore, researchers hope to understand whether these models truly possess planning capabilities or simply solve problems through approximate retrieval. 2. **Evaluating OpenAI's new model o1 (Strawberry)**: OpenAI claims that its newly introduced o1 model has been specifically trained to surpass the limitations of traditional autoregressive LLMs and become a new type of large - reasoning model (LRM). Researchers hope to evaluate o1's performance in planning tasks through PlanBench and explore its advantages and limitations compared to traditional LLMs. 3. **Proposing methods to improve benchmarking**: Given that o1 performs well in some aspects but still has problems such as accuracy and efficiency, the researchers also discuss how to expand and improve PlanBench to ensure that it can continue to be an effective tool for evaluating future LLMs and LRMs. ### Main contributions of the paper - **Performance evaluation**: Through a detailed evaluation of the performance of multiple LLMs and LRMs in Blocksworld and Mystery Blocksworld tasks, the researchers found that although o1 performs well in some tasks, its performance in complex tasks and unsolvable instances is still not robust. - **Cost and efficiency analysis**: The researchers point out that LRMs such as o1, although performing better in some tasks, have significantly higher computational costs than traditional LLMs. In addition, the reasoning process of o1 is opaque and difficult to explain, which further affects its reliability and trustworthiness. - **Future development directions**: The paper suggests that future evaluations should place more emphasis on model efficiency, cost, and guarantees, and proposes a hybrid method combining classical planners and LLMs to achieve more efficient and reliable planning capabilities. ### Conclusion In general, this paper, through detailed experiments and analyses, reveals the advantages and limitations of current LLMs and LRMs in planning tasks and provides valuable insights and suggestions for future research.

LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench

Planning in Strawberry Fields: Evaluating and Improving the Planning and Scheduling Capabilities of LRM o1

On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability

On the Planning Abilities of Large Language Models (A Critical Investigation with a Proposed Benchmark)

On the Planning Abilities of Large Language Models : A Critical Investigation

ACPBench: Reasoning about Action, Change, and Planning

PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change

Planning Anything with Rigor: General-Purpose Zero-Shot Planning with LLM-based Formalized Programming

Open Grounded Planning: Challenges and Benchmark Construction

Can We Rely on LLM Agents to Draft Long-Horizon Plans? Let's Take TravelPlanner as an Example

Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal Reasoning

Exploring and Benchmarking the Planning Capabilities of Large Language Models

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios

GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents

Can only LLMs do Reasoning?: Potential of Small Language Models in Task Planning

EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning

Translating Natural Language to Planning Goals with Large-Language Models

LLM-Assist: Enhancing Closed-Loop Planning with Language-Based Reasoning

Can LLMs Fix Issues with Reasoning Models? Towards More Likely Models for AI Planning

A Comparative Study on Reasoning Patterns of OpenAI's o1 Model