TaskLAMA: Probing the Complex Task Understanding of Language Models

Quan Yuan,Mehran Kazemi,Xin Xu,Isaac Noble,Vaiva Imbrasaite,Deepak Ramachandran
DOI: https://doi.org/10.48550/arXiv.2308.15299
2023-08-29
Abstract:Structured Complex Task Decomposition (SCTD) is the problem of breaking down a complex real-world task (such as planning a wedding) into a directed acyclic graph over individual steps that contribute to achieving the task, with edges specifying temporal dependencies between them. SCTD is an important component of assistive planning tools, and a challenge for commonsense reasoning systems. We probe how accurately SCTD can be done with the knowledge extracted from Large Language Models (LLMs). We introduce a high-quality human-annotated dataset for this problem and novel metrics to fairly assess performance of LLMs against several baselines. Our experiments reveal that LLMs are able to decompose complex tasks into individual steps effectively, with a relative improvement of 15% to 280% over the best baseline. We also propose a number of approaches to further improve their performance, with a relative improvement of 7% to 37% over the base model. However, we find that LLMs still struggle to predict pairwise temporal dependencies, which reveals a gap in their understanding of complex tasks.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use the knowledge of large - language models (LLMs) to effectively decompose complex tasks (Structured Complex Task Decomposition, SCTD). Specifically, the goal of SCTD is to decompose a complex real - world task (such as planning a wedding) into a directed acyclic graph (Directed Acyclic Graph, DAG), where nodes represent the various steps required to complete the task, and edges represent the temporal dependencies between these steps. The main contributions of the paper include: 1. Creating a high - quality human - annotated dataset named TaskLAMA, specifically for studying the understanding of complex real - world tasks. 2. Developing new evaluation metrics to fairly measure the performance of LLMs on SCTD tasks, avoiding the problem of arbitrarily increasing the metrics by simply adding duplicate sub - steps. 3. Proposing several LLM - based methods to improve the performance of SCTD tasks and comparing them with baseline methods that do not use LLMs. 4. Conducting a series of comprehensive experiments, showing that LLMs are excellent at decomposing complex tasks into a series of steps, but still have deficiencies in predicting pairwise temporal dependencies between steps. Through these efforts, the paper not only demonstrates the potential of LLMs in handling SCTD tasks, but also reveals their limitations in understanding the temporal dependencies of complex tasks. This provides directions for future research, especially in how to further improve the understanding ability of LLMs for complex tasks.