Multimodal Segmentation for Vocal Tract Modeling

Rishi Jain,Bohan Yu,Peter Wu,Tejas Prabhune,Gopala Anumanchipalli

2024-06-22

Abstract:Accurate modeling of the vocal tract is necessary to construct articulatory representations for interpretable speech processing and linguistics. However, vocal tract modeling is challenging because many internal articulators are occluded from external motion capture technologies. Real-time magnetic resonance imaging (RT-MRI) allows measuring precise movements of internal articulators during speech, but annotated datasets of MRI are limited in size due to time-consuming and computationally expensive labeling methods. We first present a deep labeling strategy for the RT-MRI video using a vision-only segmentation approach. We then introduce a multimodal algorithm using audio to improve segmentation of vocal articulators. Together, we set a new benchmark for vocal tract modeling in MRI video segmentation and use this to release labels for a 75-speaker RT-MRI dataset, increasing the amount of labeled public RT-MRI data of the vocal tract by over a factor of 9. The code and dataset labels can be found at \url{<a class="link-external link-http" href="http://rishiraij.github.io/multimodal-mri-avatar/" rel="external noopener nofollow">this http URL</a>}.

Computer Vision and Pattern Recognition,Computation and Language,Machine Learning,Sound,Audio and Speech Processing

What problem does this paper attempt to address?

This paper aims to address the problem of accurately modeling the vocal tract, which is crucial for interpretable speech processing and linguistic research. Existing techniques, such as external motion capture, cannot accurately capture the movements of the articulatory organs inside the mouth. Real-time Magnetic Resonance Imaging (RT-MRI) can capture these internal movements, but the annotation process is time-consuming and expensive. The paper proposes two main methods: 1. Using deep learning strategies to visually segment RT-MRI videos to obtain speaker-independent vocal tract boundaries. 2. Introducing multimodal algorithms to improve the segmentation of vocal articulatory organs by incorporating audio information, thereby enhancing segmentation accuracy. Through these methods, they establish a new benchmark for MRI video segmentation and provide annotations for a dataset of RT-MRI data from 75 speakers, increasing the publicly available annotated data nearly 9-fold. The code and dataset annotations can be found at rishiraij.github.io/multimodal-mri-avatar/. Experiments show that their deep learning approach provides higher quality segmentation in downstream speech tasks (such as speech synthesis) compared to existing baseline methods, and it exhibits better generalization ability for unseen speakers. Additionally, through subjective evaluations, participants tend to prefer their algorithm's output, considering it more accurately represents the vocal tract movements corresponding to the audio.

Multimodal Segmentation for Vocal Tract Modeling

Automatic segmentation of vocal tract articulators in real-time magnetic resonance imaging

Towards Automatic Speech Identification from Vocal Tract Shape Dynamics in Real-time MRI

A multispeaker dataset of raw and reconstructed speech production real-time MRI video and 3D volumetric images

Silent Speech and Emotion Recognition from Vocal Tract Shape Dynamics in Real-Time MRI

Speech2rtMRI: Speech-Guided Diffusion Model for Real-time MRI Video of the Vocal Tract during Speech

The vertebrate limb: A model system to study the Hox/hom gene network during development and evolution

Multimodal dataset of real-time 2D and static 3D MRI of healthy French speakers

Speaker dependent articulatory-to-acoustic mapping using real-time MRI of the vocal tract

Self-navigated subspace reconstruction for real-time MR imaging of the vocal tract

GeO2-SiO2-chitosan-medium-coated hollow optical fiber for cell immobilization.

Enhancing linguistic research through 2-mm isotropic 3D dynamic speech MRI optimized by sparse temporal sampling and low-rank reconstruction

Reconstructing Speech from Real-Time Articulatory MRI Using Neural Vocoders

M3AE: Multimodal Representation Learning for Brain Tumor Segmentation with Missing Modalities

A Multilinear Tongue Model Derived from Speech Related MRI Data of the Human Vocal Tract

A 3D Geometry Model of Vocal Tract Based on Smart Internet of Things

Estimation Of Vocal Tract Area Function For Mandarin Vowel Sequences Using Mri

Deep Speech Synthesis from MRI-Based Articulatory Representations

Radius Vector-Driven 3-D Mandarin Vocal Tract Model

TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation

4D magnetic resonance imaging atlas construction using temporally aligned audio waveforms in speech