DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)

Zongxin Yang,Guikun Chen,Xiaodi Li,Wenguan Wang,Yi Yang
2024-05-05
Abstract:Recent LLM-driven visual agents mainly focus on solving image-based tasks, which limits their ability to understand dynamic scenes, making it far from real-life applications like guiding students in laboratory experiments and identifying their mistakes. Hence, this paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes. Considering the video modality better reflects the ever-changing nature of real-world scenarios, we exemplify DoraemonGPT as a video agent. Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes. This structured representation allows for spatial-temporal querying and reasoning by well-designed sub-task tools, resulting in concise intermediate results. Recognizing that LLMs have limited internal knowledge when it comes to specialized domains (e.g., analyzing the scientific principles underlying experiments), we incorporate plug-and-play tools to assess external knowledge and address tasks across different domains. Moreover, a novel LLM-driven planner based on Monte Carlo Tree Search is introduced to explore the large planning space for scheduling various tools. The planner iteratively finds feasible solutions by backpropagating the result's reward, and multiple solutions can be summarized into an improved final answer. We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios. The code will be released at
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the issue that current visual agents based on large language models (LLM) mainly focus on handling static image tasks, which limits their ability to understand dynamic scenes. This results in their performance in practical applications (such as guiding students in laboratory experiments and identifying errors) being far from ideal. Therefore, this paper explores a comprehensive and conceptually elegant system driven by LLM that can understand dynamic scenes—DoraemonGPT, and demonstrates it as a video agent. Specifically, DoraemonGPT aims to address these issues in the following ways: 1. **Spatiotemporal Reasoning**: Understanding and reasoning about spatiotemporal relationships in videos, which is crucial for task decomposition and decision-making. 2. **Larger Planning Space**: Compared to handling static images, inferring high-level semantic information from the time dimension is more complex and requires exploring a larger task decomposition space. 3. **Limited Internal Knowledge**: Due to the ever-changing real world and the lack of learning from proprietary datasets, LLMs cannot encode all the knowledge needed to understand videos. Therefore, DoraemonGPT integrates pluggable tools to assess external knowledge to handle tasks in different domains. Through these methods, DoraemonGPT can effectively handle dynamic spatiotemporal tasks in multiple benchmarks and real-world scenarios, supporting comprehensive exploration of various potential solutions and extending its expertise by leveraging multi-source knowledge.