TravelPlanner: A Benchmark for Real-World Planning with Language Agents

Jian Xie,Kai Zhang,Jiangjie Chen,Tinghui Zhu,Renze Lou,Yuandong Tian,Yanghua Xiao,Yu Su
2024-10-23
Abstract:Planning has been part of the core pursuit for artificial intelligence since its conception, but earlier AI agents mostly focused on constrained settings because many of the cognitive substrates necessary for human-level planning have been lacking. Recently, language agents powered by large language models (LLMs) have shown interesting capabilities such as tool use and reasoning. Are these language agents capable of planning in more complex settings that are out of the reach of prior AI agents? To advance this investigation, we propose TravelPlanner, a new planning benchmark that focuses on travel planning, a common real-world planning scenario. It provides a rich sandbox environment, various tools for accessing nearly four million data records, and 1,225 meticulously curated planning intents and reference plans. Comprehensive evaluations show that the current language agents are not yet capable of handling such complex planning tasks-even GPT-4 only achieves a success rate of 0.6%. Language agents struggle to stay on task, use the right tools to collect information, or keep track of multiple constraints. However, we note that the mere possibility for language agents to tackle such a complex problem is in itself non-trivial progress. TravelPlanner provides a challenging yet meaningful testbed for future language agents.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate whether current language agents can plan in more complex real - world scenarios, especially in the specific task of travel planning. Traditional artificial intelligence agents usually operate in restricted environments because many of the cognitive foundations required for achieving human - level planning are not yet complete. However, with the emergence of large - language models (LLMs), a new generation of language agents has shown the ability to use tools and reason, which may fill in the cognitive foundations lacking in earlier artificial intelligence agents. Therefore, the paper proposes a new planning benchmark - TravelPlanner, which focuses on a common real - world planning scenario - travel planning. TravelPlanner provides a rich sandbox environment, containing nearly 4 million data records, as well as six tools for accessing these data. In addition, it has carefully curated 1,225 different planning intentions and reference plans, each imposing a different set of constraints. Through comprehensive evaluation, the research found that even the current state - of - the - art language models (such as GPT - 4) can only complete such complex planning tasks with a 0.6% success rate. Language models have difficulties in maintaining task direction, using the correct tools to collect information, or tracking multiple constraints. Nevertheless, the paper points out that the fact that language models can attempt to handle such complex problems is itself a non - trivial progress. Overall, this paper aims to explore and evaluate the capabilities and limitations of language models in handling complex real - world planning tasks, providing valuable insights and challenges for future research and development.