Abstract:Recently, the STEVE-1 approach has been introduced as a method for training generative agents to follow instructions in the form of latent CLIP embeddings. In this work, we present a methodology to extend the control modalities by learning a mapping from new input modalities to the latent goal space of the agent. We apply our approach to the challenging Minecraft domain, and extend the goal conditioning to include the audio modality. The resulting audio-conditioned agent is able to perform on a comparable level to the original text-conditioned and visual-conditioned agents. Specifically, we create an Audio-Video CLIP foundation model for Minecraft and an audio prior network which together map audio samples to the latent goal space of the STEVE-1 policy. Additionally, we highlight the tradeoffs that occur when conditioning on different modalities. Our training code, evaluation code, and Audio-Video CLIP foundation model for Minecraft are made open-source to help foster further research into multi-modal generalist sequential decision-making agents.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to expand the instruction modalities of generative agents so that they can perform tasks according to new input modalities (such as audio). Specifically, the paper introduces a method to expand the STEVE - 1 agent in Minecraft to be able to respond to audio instructions, not just text and visual instructions. ### Main Problems 1. **Expanding Instruction Modalities**: Existing generative agents (such as STEVE - 1) can only handle specific input modalities (such as text and visual), and cannot directly handle other types of input (such as audio). Therefore, a method is needed to expand the capabilities of these agents so that they can understand and respond to more types of instructions. 2. **Advantages and Trade - offs of Multimodal Agents**: By introducing new modalities (such as audio), researchers hope to explore the advantages and limitations of different modalities in task execution. For example, some tasks may be more suitable to be expressed by audio instructions, while other tasks may be more suitable for text or visual instructions. ### Solutions To achieve this goal, the paper proposes a method to map new modalities such as audio into the latent target space of the agent by training a new CLIP model. The specific steps are as follows: 1. **Create a New CLIP Model**: Train a CLIP model that includes audio and video modalities to generate a shared latent space. 2. **Learn the Mapping Network**: Train a mapping network to map the embedding vectors of the new modality into the latent space of the original CLIP model. 3. **Apply to STEVE - 1 Agent**: Through the above - mentioned mapping network, convert audio instructions into target vectors that the agent can understand and guide its behavior. ### Experimental Results Experiments show that the STEVE - 1 agent under audio conditions performs better than the original text and visual - condition versions on some tasks, but performs worse on some ambiguous or unclear tasks. This illustrates the trade - off between different modalities: audio may have more advantages in some cases, but may have limitations when expressing complex or specific tasks. ### Summary The main contribution of the paper is to provide an effective method to expand the instruction modalities of generative agents and explore the performance differences of multimodal agents in different tasks. Future research can further explore how to apply this method to other fields and different perceptual modalities.

STEVE-Audio: Expanding the Goal Conditioning Modalities of Embodied Agents in Minecraft

STEVE-1: A Generative Model for Text-to-Behavior in Minecraft

STEVE Series: Step-by-Step Construction of Agent Systems in Minecraft

See and Think: Embodied Agent in Virtual Environment

Open-World Multi-Task Control Through Goal-Aware Representation Learning and Adaptive Horizon Prediction

Reinforcement Learning Friendly Vision-Language Model for Minecraft

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models

Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

Odyssey: Empowering Minecraft Agents with Open-World Skills

Ghost in the Minecraft: Generally Capable Agents for Open-World Environments Via Large Language Models with Text-based Knowledge and Memory

MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Collaborative Learning

Plan4MC: Skill Reinforcement Learning and Planning for Open-World Minecraft Tasks

Collaborative Quest Completion with LLM-driven Non-Player Characters in Minecraft

Creative Agents: Empowering Agents with Imagination for Creative Tasks

Improving Agent Interactions in Virtual Environments with Language Models

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

Large Language Models as Minecraft Agents

Scaling Instructable Agents Across Many Simulated Worlds

Neural Abstructions: Abstractions that Support Construction for Grounded Language Learning