The Embodied World Model Based on LLM with Visual Information and Prediction-Oriented Prompts

Wakana Haijima,Kou Nakakubo,Masahiro Suzuki,Yutaka Matsuo
2024-06-02
Abstract:In recent years, as machine learning, particularly for vision and language understanding, has been improved, research in embedded AI has also evolved. VOYAGER is a well-known LLM-based embodied AI that enables autonomous exploration in the Minecraft world, but it has issues such as underutilization of visual data and insufficient functionality as a world model. In this research, the possibility of utilizing visual data and the function of LLM as a world model were investigated with the aim of improving the performance of embodied AI. The experimental results revealed that LLM can extract necessary information from visual data, and the utilization of the information improves its performance as a world model. It was also suggested that devised prompts could bring out the LLM's function as a world model.
Artificial Intelligence
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on two aspects: 1. **Insufficient Utilization of Visual Information**: - Existing large - language - model - based (LLM) embodied AI, such as VOYAGER, fails to fully utilize visual data when handling tasks. Specifically, VOYAGER relies on in - game cheat information to obtain environmental data instead of perceiving and understanding the environment through visual input. - To improve this, the paper proposes a method that enables LLM to extract necessary information from visual data and use it to enhance the functions of the world model. 2. **Insufficient Functions as a World Model**: - When serving as a world model, LLM lacks the explicit ability to predict future events and environmental changes. Existing world models usually make these predictions through explicit programming, while the internal future - prediction mechanism in LLM is still unclear. - The paper proposes to enhance its function as a world model by designing specific prompts to guide LLM in future prediction and planning. ### Specific Problem Description - **Utilization of Visual Information**: - VOYAGER currently does not collect or input visual information but uses Minecraft's cheat information to obtain external data for planning. To implement an embodied world model, planning needs to be based on the visual information obtained by the agent. - The paper proposes several methods for inputting visual data, including direct use of visual data and indirect use (by encoding as text). Experimental results show that indirect use of visual data (especially in the element - extraction format) can more effectively help the agent complete tasks. - **Enhancing the World - Model Function of LLM**: - Current LLM only suggests the next task based on the given information, but it is unclear whether internal future prediction is carried out. To make LLM better serve as a world model, it is necessary to explicitly instruct it to make future predictions. - The paper modifies the prompts in the automatic curriculum so that LLM not only suggests the next task but also clearly predicts the steps to reach the goal and the changes in the player's state and the environment after each step is executed. ### Experimental Results - **Utilization of Visual Information**: - Experiments show that indirect use of visual data (especially in the element - extraction format) can complete tasks within a fewer number of iterations, indicating that this method can reach milestones and achieve the final goal more quickly. - **Prediction - Oriented Prompts**: - Using prediction - oriented prompts (virtual type) significantly reduces the number of iterations required for each milestone, showing that this method can make task proposals more realistic and efficient. In conclusion, this paper aims to improve the performance of LLM - based embodied AI in complex environments, especially in the task of creating a golden pickaxe in the Minecraft environment, by introducing visual information and prediction - oriented prompts.