Abstract:In recent years, as machine learning, particularly for vision and language understanding, has been improved, research in embedded AI has also evolved. VOYAGER is a well-known LLM-based embodied AI that enables autonomous exploration in the Minecraft world, but it has issues such as underutilization of visual data and insufficient functionality as a world model. In this research, the possibility of utilizing visual data and the function of LLM as a world model were investigated with the aim of improving the performance of embodied AI. The experimental results revealed that LLM can extract necessary information from visual data, and the utilization of the information improves its performance as a world model. It was also suggested that devised prompts could bring out the LLM's function as a world model.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on two aspects: 1. **Insufficient Utilization of Visual Information**: - Existing large - language - model - based (LLM) embodied AI, such as VOYAGER, fails to fully utilize visual data when handling tasks. Specifically, VOYAGER relies on in - game cheat information to obtain environmental data instead of perceiving and understanding the environment through visual input. - To improve this, the paper proposes a method that enables LLM to extract necessary information from visual data and use it to enhance the functions of the world model. 2. **Insufficient Functions as a World Model**: - When serving as a world model, LLM lacks the explicit ability to predict future events and environmental changes. Existing world models usually make these predictions through explicit programming, while the internal future - prediction mechanism in LLM is still unclear. - The paper proposes to enhance its function as a world model by designing specific prompts to guide LLM in future prediction and planning. ### Specific Problem Description - **Utilization of Visual Information**: - VOYAGER currently does not collect or input visual information but uses Minecraft's cheat information to obtain external data for planning. To implement an embodied world model, planning needs to be based on the visual information obtained by the agent. - The paper proposes several methods for inputting visual data, including direct use of visual data and indirect use (by encoding as text). Experimental results show that indirect use of visual data (especially in the element - extraction format) can more effectively help the agent complete tasks. - **Enhancing the World - Model Function of LLM**: - Current LLM only suggests the next task based on the given information, but it is unclear whether internal future prediction is carried out. To make LLM better serve as a world model, it is necessary to explicitly instruct it to make future predictions. - The paper modifies the prompts in the automatic curriculum so that LLM not only suggests the next task but also clearly predicts the steps to reach the goal and the changes in the player's state and the environment after each step is executed. ### Experimental Results - **Utilization of Visual Information**: - Experiments show that indirect use of visual data (especially in the element - extraction format) can complete tasks within a fewer number of iterations, indicating that this method can reach milestones and achieve the final goal more quickly. - **Prediction - Oriented Prompts**: - Using prediction - oriented prompts (virtual type) significantly reduces the number of iterations required for each milestone, showing that this method can make task proposals more realistic and efficient. In conclusion, this paper aims to improve the performance of LLM - based embodied AI in complex environments, especially in the task of creating a golden pickaxe in the Minecraft environment, by introducing visual information and prediction - oriented prompts.

The Embodied World Model Based on LLM with Visual Information and Prediction-Oriented Prompts

Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld

Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception

Vision-Language Models Provide Promptable Representations for Reinforcement Learning

Language-Guided World Models: A Model-Based Approach to AI Control

A Survey on Vision-Language-Action Models for Embodied AI

3D-VLA: A 3D Vision-Language-Action Generative World Model

Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds

AP-VLM: Active Perception Enabled by Vision-Language Models

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

See and Think: Embodied Agent in Virtual Environment

VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View

Open-Vocabulary Predictive World Models from Sensor Observations

MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Collaborative Learning

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions

ImagineNav: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

The Development of LLMs for Embodied Navigation

Language Models Meet World Models: Embodied Experiences Enhance Language Models

EVA: An Embodied World Model for Future Video Anticipation