ACPBench: Reasoning about Action, Change, and Planning

Harsha Kokel,Michael Katz,Kavitha Srinivas,Shirin Sohrabi
2024-10-23
Abstract:There is an increasing body of work using Large Language Models (LLMs) as agents for orchestrating workflows and making decisions in domains that require planning and multi-step reasoning. As a result, it is imperative to evaluate LLMs on core skills required for planning. In this work, we present ACPBench, a benchmark for evaluating the reasoning tasks in the field of planning. The benchmark consists of 7 reasoning tasks over 13 planning domains. The collection is constructed from planning domains described in a formal language. This allows us to synthesize problems with provably correct solutions across many tasks and domains. Further, it allows us the luxury of scale without additional human effort, i.e., many additional problems can be created automatically. Our extensive evaluation of 22 LLMs and OpenAI o1 reasoning models highlights the significant gap in the reasoning capability of the LLMs. Our findings with OpenAI o1, a multi-turn reasoning model, reveal significant gains in performance on multiple-choice questions, yet surprisingly, no notable progress is made on boolean questions. The ACPBench collection is available at <a class="link-external link-https" href="https://ibm.github.io/ACPBench" rel="external noopener nofollow">this https URL</a>.
Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to evaluate and improve the reasoning capabilities of large language models (LLMs) in the domain of planning. Specifically, the authors propose a benchmark named ACPBench to assess the reasoning abilities of LLMs in planning tasks. The main objectives of the paper are as follows: 1. **Identify Key Reasoning Tasks**: The authors identify 7 atomic reasoning tasks that are crucial for effective planning, which include: - **Applicability**: Determining whether a certain action can be executed in a given state. - **Progression**: Understanding the outcome of a certain action or change. - **Reachability**: Assessing whether a specific sub-goal can be reached from a given state through a series of actions. - **Action Reachability**: Evaluating whether a certain instruction can be executed from a given starting point. - **Validation**: Assessing whether a specified sequence of actions is valid, applicable, and successfully achieves the goal. - **Justification**: Determining whether each action in the plan is necessary. - **Landmarks**: Identifying sub-goals that are essential for achieving the main goal. 2. **Construct the Benchmark**: ACPBench includes 13 planning domains, each with 7 reasoning tasks. The datasets for these tasks are generated from the formal Planning Domain Description Language (PDDL), ensuring data correctness and scalability. 3. **Evaluate Existing Models**: The authors conduct extensive evaluations of 22 state-of-the-art LLMs and OpenAI's o1 reasoning model, revealing significant gaps in their reasoning capabilities. Notably, these models perform well on multiple-choice questions but poorly on Boolean questions. 4. **Fine-tune Models**: To further enhance model performance, the authors fine-tune a model with 8B parameters and demonstrate performance improvements across multiple tasks, even generalizing to unseen domains. ### Summary Through ACPBench, the authors not only provide a tool for systematically evaluating the planning capabilities of LLMs but also reveal the current limitations of LLMs in reasoning tasks. They also demonstrate the potential for improvement through fine-tuning. This work is significant for advancing the application of LLMs in domains requiring multi-step reasoning and planning.