Abstract:We present a novel neural encoder system for acoustic-to-articulatory inversion. We leverage the Pink Trombone voice synthesizer that reveals articulatory parameters (e.g tongue position and vocal cord configuration). Our system is designed to identify the articulatory features responsible for producing specific acoustic characteristics contained in a neural latent representation. To generate the necessary latent embeddings, we employed two main methodologies. The first was a self-supervised variational autoencoder trained from scratch to reconstruct the input signal at the decoder stage. We conditioned its bottleneck layer with a subnetwork called the "projector," which decodes the voice synthesizer's parameters. The second methodology utilized two pretrained models: EnCodec and Wav2Vec. They eliminate the need to train the encoding process from scratch, allowing us to focus on training the projector network. This approach aimed to explore the potential of these existing models in the context of acoustic-to-articulatory inversion. By reusing the pretrained models, we significantly simplified the data processing pipeline, increasing efficiency and reducing computational overhead. The primary goal of our project was to demonstrate that these neural architectures can effectively encapsulate both acoustic and articulatory features. This prediction-based approach is much faster than traditional methods focused on acoustic feature-based parameter optimization. We validated our models by predicting six different parameters and evaluating them with objective and ViSQOL subjective-equivalent metric using both synthesizer- and human-generated sounds. The results show that the predicted parameters can generate human-like vowel sounds when input into the synthesizer. We provide the dataset, code, and detailed findings to support future research in this field.

A deep recurrent approach for acoustic-to-articulatory inversion

A Method for Inverting Shallow Sea Acoustic Parameters Based on the Backward Feedback Neural Network Model

Articulatory-WaveNet: Autoregressive Model For Acoustic-to-Articulatory Inversion

Estimate Articulatory Mri Series From Acoustic Signal Using Deep Architecture

Articulatory-to-acoustic Conversion Using BLSTM-RNNs with Augmented Input Representation.

Speaker-Independent Acoustic-to-Articulatory Speech Inversion

Articulatory-to-Acoustic Conversion with Cascaded Prediction of Spectral and Excitation Features Using Neural Networks

Unsupervised Acoustic-to-Articulatory Inversion with Variable Vocal Tract Anatomy

Articulatory Movement Prediction Using Deep Bidirectional Long Short-Term Memory Based Recurrent Neural Networks and Word/phone Embeddings

DNN-based Acoustic-to-Articulatory Inversion using Ultrasound Tongue Imaging

Two-Stream Joint-Training for Speaker Independent Acoustic-to-Articulatory Inversion

Mapping Ultrasound-Based Articulatory Images and Vowel Sounds with a Deep Neural Network Framework

Speaker-Independent Acoustic-to-Articulatory Inversion through Multi-Channel Attention Discriminator

Mongolian acoustic modeling based on deep neural network

Model-based Articulatory Phonetic Features for Improved Speech Recognition

Combined Articulatory and Auditory Processing for Improved Speech Recognition

Acoustic to Articulatory Mapping with Deep Neural Network

Unsupervised Inference of Physiologically Meaningful Articulatory Trajectories with VocalTractLab

Decoding Vocal Articulations from Acoustic Latent Representations

Evaluation Of Linear Regression For Speaker Adaptation In Hmm-Based Articulatory Movements Estimation

Deep Recurrent Convolutional Neural Network: Improving Performance For Speech Recognition