Abstract:There is an increasing body of work using Large Language Models (LLMs) as agents for orchestrating workflows and making decisions in domains that require planning and multi-step reasoning. As a result, it is imperative to evaluate LLMs on core skills required for planning. In this work, we present ACPBench, a benchmark for evaluating the reasoning tasks in the field of planning. The benchmark consists of 7 reasoning tasks over 13 planning domains. The collection is constructed from planning domains described in a formal language. This allows us to synthesize problems with provably correct solutions across many tasks and domains. Further, it allows us the luxury of scale without additional human effort, i.e., many additional problems can be created automatically. Our extensive evaluation of 22 LLMs and OpenAI o1 reasoning models highlights the significant gap in the reasoning capability of the LLMs. Our findings with OpenAI o1, a multi-turn reasoning model, reveal significant gains in performance on multiple-choice questions, yet surprisingly, no notable progress is made on boolean questions. The ACPBench collection is available at <a class="link-external link-https" href="https://ibm.github.io/ACPBench" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to evaluate and improve the reasoning capabilities of large language models (LLMs) in the domain of planning. Specifically, the authors propose a benchmark named ACPBench to assess the reasoning abilities of LLMs in planning tasks. The main objectives of the paper are as follows: 1. **Identify Key Reasoning Tasks**: The authors identify 7 atomic reasoning tasks that are crucial for effective planning, which include: - **Applicability**: Determining whether a certain action can be executed in a given state. - **Progression**: Understanding the outcome of a certain action or change. - **Reachability**: Assessing whether a specific sub-goal can be reached from a given state through a series of actions. - **Action Reachability**: Evaluating whether a certain instruction can be executed from a given starting point. - **Validation**: Assessing whether a specified sequence of actions is valid, applicable, and successfully achieves the goal. - **Justification**: Determining whether each action in the plan is necessary. - **Landmarks**: Identifying sub-goals that are essential for achieving the main goal. 2. **Construct the Benchmark**: ACPBench includes 13 planning domains, each with 7 reasoning tasks. The datasets for these tasks are generated from the formal Planning Domain Description Language (PDDL), ensuring data correctness and scalability. 3. **Evaluate Existing Models**: The authors conduct extensive evaluations of 22 state-of-the-art LLMs and OpenAI's o1 reasoning model, revealing significant gaps in their reasoning capabilities. Notably, these models perform well on multiple-choice questions but poorly on Boolean questions. 4. **Fine-tune Models**: To further enhance model performance, the authors fine-tune a model with 8B parameters and demonstrate performance improvements across multiple tasks, even generalizing to unseen domains. ### Summary Through ACPBench, the authors not only provide a tool for systematically evaluating the planning capabilities of LLMs but also reveal the current limitations of LLMs in reasoning tasks. They also demonstrate the potential for improvement through fine-tuning. This work is significant for advancing the application of LLMs in domains requiring multi-step reasoning and planning.

ACPBench: Reasoning about Action, Change, and Planning

PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change

On the Planning Abilities of Large Language Models (A Critical Investigation with a Proposed Benchmark)

ActionReasoningBench: Reasoning about Actions with and without Ramification Constraints

LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench

GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents

Exploring and Benchmarking the Planning Capabilities of Large Language Models

Reasoning with Language Model is Planning with World Model

On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability

Open Grounded Planning: Challenges and Benchmark Construction

Can We Further Elicit Reasoning in LLMs? Critic-Guided Planning with Retrieval-Augmentation for Solving Challenging Tasks

Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form Planning

On the Planning Abilities of Large Language Models : A Critical Investigation

WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks

ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure

EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

REL: Working out is all you need

Cooperative Strategic Planning Enhances Reasoning Capabilities in Large Language Models

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

CLR-Bench: Evaluating Large Language Models in College-level Reasoning