RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents

Tomoyuki Kagaya,Thong Jing Yuan,Yuxuan Lou,Jayashree Karlekar,Sugiri Pranata,Akira Kinose,Koki Oguri,Felix Wick,Yang You
2024-02-06
Abstract:Owing to recent advancements, Large Language Models (LLMs) can now be deployed as agents for increasingly complex decision-making applications in areas including robotics, gaming, and API integration. However, reflecting past experiences in current decision-making processes, an innate human behavior, continues to pose significant challenges. Addressing this, we propose Retrieval-Augmented Planning (RAP) framework, designed to dynamically leverage past experiences corresponding to the current situation and context, thereby enhancing agents' planning capabilities. RAP distinguishes itself by being versatile: it excels in both text-only and multimodal environments, making it suitable for a wide range of tasks. Empirical evaluations demonstrate RAP's effectiveness, where it achieves SOTA performance in textual scenarios and notably enhances multimodal LLM agents' performance for embodied tasks. These results highlight RAP's potential in advancing the functionality and applicability of LLM agents in complex, real-world applications.
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The paper proposes a framework called Retrieval-Augmented Planning (RAP) to address the performance issues of large language models (LLMs) in complex decision-making tasks. Currently, despite the impressive performance of LLMs in areas such as robotics, gaming, and API integration, they lack the ability to leverage past experiences. RAP enhances the planning capability by dynamically storing and retrieving past experiences relevant to the current context, adapting to textual and multimodal environments. Experimental results demonstrate that RAP outperforms existing methods in both textual and multimodal benchmark tests, improving the performance of LLM agents, especially in functional and practical aspects of complex real-world applications.