Multimodal Speech Recognition for Language-Guided Embodied Agents

Allen Chang,Xiaoyuan Zhu,Aarav Monga,Seoho Ahn,Tejas Srinivasan,Jesse Thomason

DOI: https://doi.org/10.21437/Interspeech.2023-2262

2023-10-10

Abstract:Benchmarks for language-guided embodied agents typically assume text-based instructions, but deployed agents will encounter spoken instructions. While Automatic Speech Recognition (ASR) models can bridge the input gap, erroneous ASR transcripts can hurt the agents' ability to complete tasks. In this work, we propose training a multimodal ASR model to reduce errors in transcribing spoken instructions by considering the accompanying visual context. We train our model on a dataset of spoken instructions, synthesized from the ALFRED task completion dataset, where we simulate acoustic noise by systematically masking spoken words. We find that utilizing visual observations facilitates masked word recovery, with multimodal ASR models recovering up to 30% more masked words than unimodal baselines. We also find that a text-trained embodied agent successfully completes tasks more often by following transcribed instructions from multimodal ASR models. <a class="link-external link-http" href="http://github.com/Cylumn/embodied-multimodal-asr" rel="external noopener nofollow">this http URL</a>

Computation and Language,Sound,Audio and Speech Processing

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how language - guided embodied agents can understand human oral instructions more accurately when performing household tasks. Specifically, existing benchmark tests usually assume that embodied agents receive text - based instructions, while in practical applications, these agents need to be able to process oral instructions. Automatic speech recognition (ASR) technology can transcribe oral instructions into text, but incorrect transcriptions will reduce the agents' ability to complete tasks. Therefore, the paper proposes a multimodal ASR model, which uses the accompanying visual context to reduce errors in oral instruction transcription. The main contributions of the paper include: 1. **Multimodal ASR model**: A multimodal ASR model that combines audio and visual information is proposed to improve the accuracy of transcribing audio instructions in noisy environments. 2. **Synthetic dataset**: A synthetic spoken - instruction dataset is created by systematically masking part of the audio signal to simulate noisy environments. 3. **Experimental verification**: The performance of the multimodal ASR model in different environments and with different speakers is experimentally verified, demonstrating its effectiveness in increasing the success rate of task completion. Through these methods, the paper aims to improve the ability of embodied agents to understand and execute oral instructions in practical application scenarios, especially in noisy environments and with new speakers.

Multimodal Speech Recognition for Language-Guided Embodied Agents

A Multimodal Approach to Device-Directed Speech Detection with Large Language Models

Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models

Multimodal Audio-textual Architecture for Robust Spoken Language Understanding

Embodied multimodal multitask learning

MISAR: A Multimodal Instructional System with Augmented Reality

Multitask Multimodal Prompted Training for Interactive Embodied Task Completion

Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding

Can Visual Context Improve Automatic Speech Recognition for an Embodied Agent?

Best of Both Worlds: Multi-Task Audio-Visual Automatic Speech Recognition and Active Speaker Detection

Joint Speech-Text Embeddings for Multitask Speech Processing

Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction

Rethinking Speech Recognition with A Multimodal Perspective via Acoustic and Semantic Cooperative Decoding

Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations

Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld

LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation

Multimodal fusion-powered English speaking robot

Beyond Text: Utilizing Vocal Cues to Improve Decision Making in LLMs for Robot Navigation Tasks

Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models

Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models