Abstract:This paper presents a method to convert articulatory movements into speech waveforms using a data-driven approach. In this method, recorded electromagnetic midsagittal articulography (EMA) measurements are converted into both spectral features (i.e., Mel-cepstra) and excitation features (i.e., power, voiced/unvoiced flag, and F0) from which speech waveforms are then reconstructed. By considering the nonlinear and dynamic dependency relationships between articulatory movements and acoustic signals, this study adopts bidirectional long short-term memory (BLSTM) based recurrent neural networks (RNN) for the articulatory-to-acoustic conversion. Due to the limitations of current data acquisition technology, recorded articulatory movements are inevitably insufficient to completely describe the articulatory configuration during pronunciation. Therefore, this paper proposes to further augment the model input by concatenating EMA vectors with two other representations. First, the posterior probabilities derived from a phoneme classifier are concatenated with EMA features to provide a linguistic description of each frame for acoustic feature prediction. The classifier is trained to determine the phoneme label of each frame based on the observed EMA features. Second, a cascaded prediction strategy is designed to utilize the predicted spectral features as auxiliary input to boost the prediction accuracy of the excitation features. The results of experiment show that BLSTM-RNNs can achieve a better objective and subjective performance than deep neural networks (DNN) and Gaussian mixture models (GMM) in articulatory-to-acoustic conversion. In addition, our results show that the proposed methods for integrating linguistic representation and utilizing a cascaded prediction strategy can further improve the accuracy of acoustic feature prediction.

Acoustic to Articulatory Mapping with Deep Neural Network

A real-time speech driven talking avatar based on deep neural network.

Speaker dependent articulatory-to-acoustic mapping using real-time MRI of the vocal tract

Audio-driven Talking Face Video Generation with Natural Head Pose

Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models

Estimate Articulatory Mri Series From Acoustic Signal Using Deep Architecture

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation

Acoustic VR in the Mouth: A Real-Time Speech-Driven Visual Tongue System.

Expressive Speech Driven Talking Avatar Synthesis with DBLSTM Using Limited Amount of Emotional Bimodal Data

Synthesized Stereo Mapping Via Deep Neural Networks for Noisy Speech Recognition

Articulatory-to-acoustic Conversion Using BLSTM-RNNs with Augmented Input Representation.

Speech driven photo realistic facial animation based on an articulatory DBN model and AAM features

Speech Driven MPEG-4 Based Face Animation via Neural Network

Phoneme Embedding and its Application to Speech Driven Talking Avatar Synthesis

A deep recurrent approach for acoustic-to-articulatory inversion

DNN-based Acoustic-to-Articulatory Inversion using Ultrasound Tongue Imaging

Audio-driven facial animation by joint end-to-end learning of pose and emotion

Toward a Better Understanding of Deep Neural Network Based Acoustic Modelling: An Empirical Investigation

Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking Head Generation Using Phonetic Posteriorgrams

Mining Audio/Visual Database For Speech Driven Face Animation

A Multimodal Approach of Generating 3D Human-Like Talking Agent.