StateAct: State Tracking and Reasoning for Acting and Planning with Large Language Models

Nikolai Rozanov,Marek Rei
2024-09-21
Abstract:Planning and acting to solve `real' tasks using large language models (LLMs) in interactive environments has become a new frontier for AI methods. While recent advances allowed LLMs to interact with online tools, solve robotics tasks and many more, long range reasoning tasks remain a problem for LLMs. Existing methods to address this issue are very resource intensive and require additional data or human crafted rules, instead, we propose a simple method based on few-shot in-context learning alone to enhance `chain-of-thought' with state-tracking for planning and acting with LLMs. We show that our method establishes the new state-of-the-art on Alfworld for in-context learning methods (\textbf{+14\%} over the previous best few-shot in-context learning method) and performs on par with methods that use additional training data and additional tools such as code-execution. We also demonstrate that our enhanced `chain-of-states' allows the agent to both solve longer horizon problems and to be more efficient in number of steps required to solve a task. We show that our method works across a variety of LLMs for both API-based and open source ones. Finally, we also conduct ablation studies and show that `chain-of-thoughts' helps state-tracking accuracy, while a json-structure harms overall performance. We open-source our code and annotations at \url{<a class="link-external link-https" href="https://github.com/ai-nikolai/StateAct" rel="external noopener nofollow">this https URL</a>}.
Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the challenges that large - language models (LLMs) encounter when dealing with long - time - span tasks. Specifically, the author points out that although LLMs have made remarkable progress in interacting with online tools, solving robotic tasks, etc., they still have difficulties in tasks involving long - time reasoning. Existing methods usually require additional data or manually - written rules, which makes them resource - intensive and difficult to scale. To solve these problems, the author proposes a new method named **StateAct**, which is based on few - shot in - context learning. It enhances the chain - of - thought through "goal reminder" and "state tracking". This method not only does not require additional training data or external tools, but also can significantly improve the performance of LLMs in long - time reasoning tasks. #### Main contributions: 1. **Introduced "goal reminder" and "state tracking"**: By explicitly reminding the model of the current goal during each reasoning process and tracking the model's state (such as position and inventory), it helps the model to better perform long - time planning and reasoning. 2. **Improved the performance in long - time reasoning tasks**: In the Alfworld environment, StateAct has a 14% higher success rate than the previous best few - shot in - context learning method, and in some cases even exceeds the method using additional tools. 3. **Reduced the number of steps required to complete the task**: Experiments show that StateAct can not only solve longer - time - span tasks, but also reduce the number of steps required to complete the task, thus improving efficiency. #### Specific problem descriptions: - **Challenges in long - time reasoning tasks**: Existing LLMs perform poorly in handling long - time reasoning tasks, especially without additional resources. - **Resource - intensive solutions**: Existing solutions usually require additional data or manually - written rules, which makes them difficult to be widely applied. - **Improving efficiency and accuracy**: How to improve the performance and efficiency of LLMs in long - time reasoning tasks without adding additional resources. Through these improvements, StateAct provides a simple and effective method, enabling LLMs to better perform long - time reasoning tasks in complex interactive environments.