Abstract:This paper presents a method to convert articulatory movements into speech waveforms using a data-driven approach. In this method, recorded electromagnetic midsagittal articulography (EMA) measurements are converted into both spectral features (i.e., Mel-cepstra) and excitation features (i.e., power, voiced/unvoiced flag, and F0) from which speech waveforms are then reconstructed. By considering the nonlinear and dynamic dependency relationships between articulatory movements and acoustic signals, this study adopts bidirectional long short-term memory (BLSTM) based recurrent neural networks (RNN) for the articulatory-to-acoustic conversion. Due to the limitations of current data acquisition technology, recorded articulatory movements are inevitably insufficient to completely describe the articulatory configuration during pronunciation. Therefore, this paper proposes to further augment the model input by concatenating EMA vectors with two other representations. First, the posterior probabilities derived from a phoneme classifier are concatenated with EMA features to provide a linguistic description of each frame for acoustic feature prediction. The classifier is trained to determine the phoneme label of each frame based on the observed EMA features. Second, a cascaded prediction strategy is designed to utilize the predicted spectral features as auxiliary input to boost the prediction accuracy of the excitation features. The results of experiment show that BLSTM-RNNs can achieve a better objective and subjective performance than deep neural networks (DNN) and Gaussian mixture models (GMM) in articulatory-to-acoustic conversion. In addition, our results show that the proposed methods for integrating linguistic representation and utilizing a cascaded prediction strategy can further improve the accuracy of acoustic feature prediction.

A hidden trajectory model with bi-directional target filtering: cascaded vs. integrated implementation for phonetic recognition

Statistical Parametric Speech Synthesis Using a Hidden Trajectory Model

Formant Speech Synthesis Based on Trainable Model

Target-filtering model based articulatory movement prediction for articulatory control of HMM-based speech synthesis

High-Accuracy and Low-Latency Speech Recognition with Two-Head Contextual Layer Trajectory LSTM Model

Modeling DCT Parameterized F0 Trajectory at Intonation Phrase Level with DNN or Decision Tree

Enhancing CTC-based speech recognition with diverse modeling units

Hidden Markov Acoustic Modeling with Bootstrap and Restructuring for Low-Resourced Languages

Phonetic Temporal Neural Model for Language Identification

Improving Factored Hybrid HMM Acoustic Modeling without State Tying

Feature-Space Transform Tying in Unified Acoustic-Articulatory Modelling for Articulatory Control of HMM-Based Speech Synthesis.

A Novel HTS System Using both Continuous HMMs and Discrete HMMs

Boosting Hybrid Autoregressive Transducer-based ASR with Internal Acoustic Model Training and Dual Blank Thresholding

Deep neural networks based speaker modeling at different levels of phonetic granularity

Sign Language Recognition Based on Trajectory Modeling with HMMs.

Research on Polynomial-Fitting Speech-Trajectory Model in Chinese Continuous Speech Recognition

Articulatory-to-acoustic Conversion Using BLSTM-RNNs with Augmented Input Representation.

A model of speech recognition for hearing-impaired listeners based on deep learning

Phonotactic language recognition based on DNN-HMM acoustic model

Discriminative training of GMM-HMM acoustic model by RPCL type Bayesian Ying-Yang harmony learning

Probabilistic Latent Speaker Analysis for Large Vocabulary Speech Recognition