Multimodal Segmentation for Vocal Tract Modeling

Rishi Jain,Bohan Yu,Peter Wu,Tejas Prabhune,Gopala Anumanchipalli
2024-06-22
Abstract:Accurate modeling of the vocal tract is necessary to construct articulatory representations for interpretable speech processing and linguistics. However, vocal tract modeling is challenging because many internal articulators are occluded from external motion capture technologies. Real-time magnetic resonance imaging (RT-MRI) allows measuring precise movements of internal articulators during speech, but annotated datasets of MRI are limited in size due to time-consuming and computationally expensive labeling methods. We first present a deep labeling strategy for the RT-MRI video using a vision-only segmentation approach. We then introduce a multimodal algorithm using audio to improve segmentation of vocal articulators. Together, we set a new benchmark for vocal tract modeling in MRI video segmentation and use this to release labels for a 75-speaker RT-MRI dataset, increasing the amount of labeled public RT-MRI data of the vocal tract by over a factor of 9. The code and dataset labels can be found at \url{<a class="link-external link-http" href="http://rishiraij.github.io/multimodal-mri-avatar/" rel="external noopener nofollow">this http URL</a>}.
Computer Vision and Pattern Recognition,Computation and Language,Machine Learning,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
This paper aims to address the problem of accurately modeling the vocal tract, which is crucial for interpretable speech processing and linguistic research. Existing techniques, such as external motion capture, cannot accurately capture the movements of the articulatory organs inside the mouth. Real-time Magnetic Resonance Imaging (RT-MRI) can capture these internal movements, but the annotation process is time-consuming and expensive. The paper proposes two main methods: 1. Using deep learning strategies to visually segment RT-MRI videos to obtain speaker-independent vocal tract boundaries. 2. Introducing multimodal algorithms to improve the segmentation of vocal articulatory organs by incorporating audio information, thereby enhancing segmentation accuracy. Through these methods, they establish a new benchmark for MRI video segmentation and provide annotations for a dataset of RT-MRI data from 75 speakers, increasing the publicly available annotated data nearly 9-fold. The code and dataset annotations can be found at rishiraij.github.io/multimodal-mri-avatar/. Experiments show that their deep learning approach provides higher quality segmentation in downstream speech tasks (such as speech synthesis) compared to existing baseline methods, and it exhibits better generalization ability for unseen speakers. Additionally, through subjective evaluations, participants tend to prefer their algorithm's output, considering it more accurately represents the vocal tract movements corresponding to the audio.