Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages

Max Zuo,Francisco Piedrahita Velez,Xiaochen Li,Michael L. Littman,Stephen H. Bach

2024-07-04

Abstract:Many recent works have explored using language models for planning problems. One line of research focuses on translating natural language descriptions of planning tasks into structured planning languages, such as the planning domain definition language (PDDL). While this approach is promising, accurately measuring the quality of generated PDDL code continues to pose significant challenges. First, generated PDDL code is typically evaluated using planning validators that check whether the problem can be solved with a planner. This method is insufficient because a language model might generate valid PDDL code that does not align with the natural language description of the task. Second, existing evaluation sets often have natural language descriptions of the planning task that closely resemble the ground truth PDDL, reducing the challenge of the task. To bridge this gap, we introduce \benchmarkName, a benchmark designed to evaluate language models' ability to generate PDDL code from natural language descriptions of planning tasks. We begin by creating a PDDL equivalence algorithm that rigorously evaluates the correctness of PDDL code generated by language models by flexibly comparing it against a ground truth PDDL. Then, we present a dataset of $132,037$ text-to-PDDL pairs across 13 different tasks, with varying levels of difficulty. Finally, we evaluate several API-access and open-weight language models that reveal this task's complexity. For example, $87.6\%$ of the PDDL problem descriptions generated by GPT-4o are syntactically parseable, $82.2\%$ are valid, solve-able problems, but only $35.1\%$ are semantically correct, highlighting the need for a more rigorous benchmark for this problem.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

This paper introduces a benchmark platform called Planetarium, which aims to evaluate the ability of language models to transform natural language descriptions of planning tasks into structured planning languages such as PDDL. The existing challenge is that current methods usually only validate the generated PDDL code through a validator to check if it can solve the planning problem, ignoring the consistency between natural language descriptions and PDDL code. In addition, the existing evaluation datasets often have natural language descriptions that are too similar to real PDDL, reducing the difficulty of the task. To address these issues, Planetarium proposes a rigorous PDDL equivalence algorithm that can flexibly compare the generated PDDL code with the actual PDDL code to evaluate its correctness. The paper also introduces a large dataset consisting of 132,037 text-to-PDDL pairs, covering 13 different tasks and difficulty levels. Through evaluations on various API accesses and open-source language models, the complexity of this task is revealed. For example, in the zero-shot setting, only 35.1% of the PDDL problem descriptions generated by GPT-4 are semantically correct. The paper discusses two approaches currently used with language models to solve planning problems: directly using language models to generate plans, and using language models to convert natural language into PDDL and then leveraging traditional planners. Planetarium, as a benchmark, focuses on evaluating the latter approach, which is the accuracy of language models in converting natural language descriptions to PDDL, and emphasizes the need for stricter benchmarks to drive the development of this field.

Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages

PROC2PDDL: Open-Domain Planning Representations from Texts

Leveraging Environment Interaction for Automated PDDL Translation and Planning with Large Language Models

NL2Plan: Robust LLM-Driven Planning from Minimal Text Descriptions

On the Limit of Language Models as Planning Formalizers

PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change

Exploring and Benchmarking the Planning Capabilities of Large Language Models

NATURAL PLAN: Benchmarking LLMs on Natural Language Planning

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

TravelPlanner: A Benchmark for Real-World Planning with Language Agents

TIC: Translate-Infer-Compile for accurate "text to plan" using LLMs and Logical Representations

Open Grounded Planning: Challenges and Benchmark Construction

PLUGH: A Benchmark for Spatial Understanding and Reasoning in Large Language Models

Translating Natural Language to Planning Goals with Large-Language Models

Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning

EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios

Generalized Planning in PDDL Domains with Pretrained Large Language Models

Planning Anything with Rigor: General-Purpose Zero-Shot Planning with LLM-based Formalized Programming

On the Planning Abilities of Large Language Models (A Critical Investigation with a Proposed Benchmark)