TravelPlanner: A Benchmark for Real-World Planning with Language Agents

Jian Xie,Kai Zhang,Jiangjie Chen,Tinghui Zhu,Renze Lou,Yuandong Tian,Yanghua Xiao,Yu Su

2024-10-23

Abstract:Planning has been part of the core pursuit for artificial intelligence since its conception, but earlier AI agents mostly focused on constrained settings because many of the cognitive substrates necessary for human-level planning have been lacking. Recently, language agents powered by large language models (LLMs) have shown interesting capabilities such as tool use and reasoning. Are these language agents capable of planning in more complex settings that are out of the reach of prior AI agents? To advance this investigation, we propose TravelPlanner, a new planning benchmark that focuses on travel planning, a common real-world planning scenario. It provides a rich sandbox environment, various tools for accessing nearly four million data records, and 1,225 meticulously curated planning intents and reference plans. Comprehensive evaluations show that the current language agents are not yet capable of handling such complex planning tasks-even GPT-4 only achieves a success rate of 0.6%. Language agents struggle to stay on task, use the right tools to collect information, or keep track of multiple constraints. However, we note that the mere possibility for language agents to tackle such a complex problem is in itself non-trivial progress. TravelPlanner provides a challenging yet meaningful testbed for future language agents.

Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate whether current language agents can plan in more complex real - world scenarios, especially in the specific task of travel planning. Traditional artificial intelligence agents usually operate in restricted environments because many of the cognitive foundations required for achieving human - level planning are not yet complete. However, with the emergence of large - language models (LLMs), a new generation of language agents has shown the ability to use tools and reason, which may fill in the cognitive foundations lacking in earlier artificial intelligence agents. Therefore, the paper proposes a new planning benchmark - TravelPlanner, which focuses on a common real - world planning scenario - travel planning. TravelPlanner provides a rich sandbox environment, containing nearly 4 million data records, as well as six tools for accessing these data. In addition, it has carefully curated 1,225 different planning intentions and reference plans, each imposing a different set of constraints. Through comprehensive evaluation, the research found that even the current state - of - the - art language models (such as GPT - 4) can only complete such complex planning tasks with a 0.6% success rate. Language models have difficulties in maintaining task direction, using the correct tools to collect information, or tracking multiple constraints. Nevertheless, the paper points out that the fact that language models can attempt to handle such complex problems is itself a non - trivial progress. Overall, this paper aims to explore and evaluate the capabilities and limitations of language models in handling complex real - world planning tasks, providing valuable insights and challenges for future research and development.

TravelPlanner: A Benchmark for Real-World Planning with Language Agents

TravelPlanner: A Benchmark for Real-World Planning with Language Agents

ChinaTravel: A Real-World Benchmark for Language Agents in Chinese Travel Planning

Smart Language Agents in Real-World Planning

Revealing the Barriers of Language Agents in Planning

Can We Rely on LLM Agents to Draft Long-Horizon Plans? Let's Take TravelPlanner as an Example

TravelAgent: An AI Assistant for Personalized Travel Planning

NATURAL PLAN: Benchmarking LLMs on Natural Language Planning

A Human-Like Reasoning Framework for Multi-Phases Planning Task with Large Language Models

Large Language Models Can Solve Real-World Planning Rigorously with Formal Verification Tools

Exploring and Benchmarking the Planning Capabilities of Large Language Models

What's the Plan? Evaluating and Developing Planning-Aware Techniques for Language Models

Ask-before-Plan: Proactive Language Agents for Real-World Planning

Planning with Multi-Constraints via Collaborative Language Agents

EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios

ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities

On the Planning Abilities of Large Language Models (A Critical Investigation with a Proposed Benchmark)

LASP: Surveying the State-of-the-Art in Large Language Model-Assisted AI Planning

ReasonPlanner: Enhancing Autonomous Planning in Dynamic Environments with Temporal Knowledge Graphs and LLMs

One STEP at a time: Language Agents are Stepwise Planners

PlanAgent: A Multi-modal Large Language Agent for Closed-loop Vehicle Motion Planning