STEVE-Audio: Expanding the Goal Conditioning Modalities of Embodied Agents in Minecraft

Nicholas Lenzen,Amogh Raut,Andrew Melnik
2024-12-02
Abstract:Recently, the STEVE-1 approach has been introduced as a method for training generative agents to follow instructions in the form of latent CLIP embeddings. In this work, we present a methodology to extend the control modalities by learning a mapping from new input modalities to the latent goal space of the agent. We apply our approach to the challenging Minecraft domain, and extend the goal conditioning to include the audio modality. The resulting audio-conditioned agent is able to perform on a comparable level to the original text-conditioned and visual-conditioned agents. Specifically, we create an Audio-Video CLIP foundation model for Minecraft and an audio prior network which together map audio samples to the latent goal space of the STEVE-1 policy. Additionally, we highlight the tradeoffs that occur when conditioning on different modalities. Our training code, evaluation code, and Audio-Video CLIP foundation model for Minecraft are made open-source to help foster further research into multi-modal generalist sequential decision-making agents.
Machine Learning,Artificial Intelligence,Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to expand the instruction modalities of generative agents so that they can perform tasks according to new input modalities (such as audio). Specifically, the paper introduces a method to expand the STEVE - 1 agent in Minecraft to be able to respond to audio instructions, not just text and visual instructions. ### Main Problems 1. **Expanding Instruction Modalities**: Existing generative agents (such as STEVE - 1) can only handle specific input modalities (such as text and visual), and cannot directly handle other types of input (such as audio). Therefore, a method is needed to expand the capabilities of these agents so that they can understand and respond to more types of instructions. 2. **Advantages and Trade - offs of Multimodal Agents**: By introducing new modalities (such as audio), researchers hope to explore the advantages and limitations of different modalities in task execution. For example, some tasks may be more suitable to be expressed by audio instructions, while other tasks may be more suitable for text or visual instructions. ### Solutions To achieve this goal, the paper proposes a method to map new modalities such as audio into the latent target space of the agent by training a new CLIP model. The specific steps are as follows: 1. **Create a New CLIP Model**: Train a CLIP model that includes audio and video modalities to generate a shared latent space. 2. **Learn the Mapping Network**: Train a mapping network to map the embedding vectors of the new modality into the latent space of the original CLIP model. 3. **Apply to STEVE - 1 Agent**: Through the above - mentioned mapping network, convert audio instructions into target vectors that the agent can understand and guide its behavior. ### Experimental Results Experiments show that the STEVE - 1 agent under audio conditions performs better than the original text and visual - condition versions on some tasks, but performs worse on some ambiguous or unclear tasks. This illustrates the trade - off between different modalities: audio may have more advantages in some cases, but may have limitations when expressing complex or specific tasks. ### Summary The main contribution of the paper is to provide an effective method to expand the instruction modalities of generative agents and explore the performance differences of multimodal agents in different tasks. Future research can further explore how to apply this method to other fields and different perceptual modalities.