Abstract:This paper presents a method to convert articulatory movements into speech waveforms using a data-driven approach. In this method, recorded electromagnetic midsagittal articulography (EMA) measurements are converted into both spectral features (i.e., Mel-cepstra) and excitation features (i.e., power, voiced/unvoiced flag, and F0) from which speech waveforms are then reconstructed. By considering the nonlinear and dynamic dependency relationships between articulatory movements and acoustic signals, this study adopts bidirectional long short-term memory (BLSTM) based recurrent neural networks (RNN) for the articulatory-to-acoustic conversion. Due to the limitations of current data acquisition technology, recorded articulatory movements are inevitably insufficient to completely describe the articulatory configuration during pronunciation. Therefore, this paper proposes to further augment the model input by concatenating EMA vectors with two other representations. First, the posterior probabilities derived from a phoneme classifier are concatenated with EMA features to provide a linguistic description of each frame for acoustic feature prediction. The classifier is trained to determine the phoneme label of each frame based on the observed EMA features. Second, a cascaded prediction strategy is designed to utilize the predicted spectral features as auxiliary input to boost the prediction accuracy of the excitation features. The results of experiment show that BLSTM-RNNs can achieve a better objective and subjective performance than deep neural networks (DNN) and Gaussian mixture models (GMM) in articulatory-to-acoustic conversion. In addition, our results show that the proposed methods for integrating linguistic representation and utilizing a cascaded prediction strategy can further improve the accuracy of acoustic feature prediction.

Novel Acoustic Modeling with Structured Hidden Dynamics for Speech Coarticulation and Reduction

A hidden trajectory model with bi-directional target filtering: cascaded vs. integrated implementation for phonetic recognition

Ultraviolet irradiation of murine skin alters cluster formation between lymph node dendritic cells and specific T lymphocytes.

The Hidden Markov Model of co-articulation and its application to the continuous speech recognition

Articulatory-to-acoustic Conversion Using BLSTM-RNNs with Augmented Input Representation.

Acoustic Modeling Based On Chinese Phonetics Knowledge

Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition

Toward a Better Understanding of Deep Neural Network Based Acoustic Modelling: An Empirical Investigation

Deep Recurrent Neural Networks for Acoustic Modelling

Feature-Space Transform Tying in Unified Acoustic-Articulatory Modelling for Articulatory Control of HMM-Based Speech Synthesis.

A Joint Spectro-Temporal Relational Thinking Based Acoustic Modeling Framework

Modeling F0 Trajectories in Hierarchically Structured Deep Neural Networks.

Enhancing CTC-based speech recognition with diverse modeling units

An Acoustic Model for English Speech Recognition Based on Deep Learning

Effective Acoustic Modeling for Pronunciation Quality Scoring of Strongly Accented Mandarin Speech

High-Accuracy and Low-Latency Speech Recognition with Two-Head Contextual Layer Trajectory LSTM Model

Acoustic Modeling for Multi-Array Conversational Speech Recognition in the Chime-6 Challenge

Hidden Markov Acoustic Modeling with Bootstrap and Restructuring for Low-Resourced Languages

Statistical Parametric Speech Synthesis Using a Hidden Trajectory Model

Recent Progresses in Deep Learning based Acoustic Models (Updated)