Abstract:This paper presents a method to convert articulatory movements into speech waveforms using a data-driven approach. In this method, recorded electromagnetic midsagittal articulography (EMA) measurements are converted into both spectral features (i.e., Mel-cepstra) and excitation features (i.e., power, voiced/unvoiced flag, and F0) from which speech waveforms are then reconstructed. By considering the nonlinear and dynamic dependency relationships between articulatory movements and acoustic signals, this study adopts bidirectional long short-term memory (BLSTM) based recurrent neural networks (RNN) for the articulatory-to-acoustic conversion. Due to the limitations of current data acquisition technology, recorded articulatory movements are inevitably insufficient to completely describe the articulatory configuration during pronunciation. Therefore, this paper proposes to further augment the model input by concatenating EMA vectors with two other representations. First, the posterior probabilities derived from a phoneme classifier are concatenated with EMA features to provide a linguistic description of each frame for acoustic feature prediction. The classifier is trained to determine the phoneme label of each frame based on the observed EMA features. Second, a cascaded prediction strategy is designed to utilize the predicted spectral features as auxiliary input to boost the prediction accuracy of the excitation features. The results of experiment show that BLSTM-RNNs can achieve a better objective and subjective performance than deep neural networks (DNN) and Gaussian mixture models (GMM) in articulatory-to-acoustic conversion. In addition, our results show that the proposed methods for integrating linguistic representation and utilizing a cascaded prediction strategy can further improve the accuracy of acoustic feature prediction.

A novel neural-based pronunciation modeling method for robust speech recognition.

Articulatory-to-acoustic Conversion Using BLSTM-RNNs with Augmented Input Representation.

Improving Accented Mandarin Speech Recognition by Using Recurrent Neural Network Based Language Model Adaptation

Phonotactic language recognition based on DNN-HMM acoustic model

Pronunciation-aware unique character encoding for RNN Transducer-based Mandarin speech recognition

Combined Acoustic and Pronunciation Modelling for Non-Native Speech Recognition

Improving Pronunciation for Non-Native Speakers Using Neural Networks

Improve low-resource non-native mispronunciation detection with native speech by articulatory-based tandem feature

AI Recognition Method of Pronunciation Errors in Oral English Speech with the Help of Big Data for Personalized Learning

Recurrent Neural Network Based Language Model Adaptation for Accent Mandarin Speech.

Deep Neural Network-based Mixed Speech Recognition Technology for Chinese and English

A hidden Markov optimization model for processing and recognition of English speech feature signals

Advanced Recurrent Network-Based Hybrid Acoustic Models for Low Resource Speech Recognition

A Deep Recurrent Approach for Acoustic-to-articulatory Inversion.

Using Phoneme Representations to Build Predictive Models Robust to ASR Errors

An Acoustic Model for English Speech Recognition Based on Deep Learning

A two-stage mispronunciation detection approach for computer-assisted pronunciation training

Effective Acoustic Modeling for Pronunciation Quality Scoring of Strongly Accented Mandarin Speech

Improving Speech Recognition Error Prediction for Modern and Off-the-shelf Speech Recognizers

English Speech Recognition System Model Based on Computer-Aided Function and Neural Network Algorithm