Abstract:This paper presents a method to convert articulatory movements into speech waveforms using a data-driven approach. In this method, recorded electromagnetic midsagittal articulography (EMA) measurements are converted into both spectral features (i.e., Mel-cepstra) and excitation features (i.e., power, voiced/unvoiced flag, and F0) from which speech waveforms are then reconstructed. By considering the nonlinear and dynamic dependency relationships between articulatory movements and acoustic signals, this study adopts bidirectional long short-term memory (BLSTM) based recurrent neural networks (RNN) for the articulatory-to-acoustic conversion. Due to the limitations of current data acquisition technology, recorded articulatory movements are inevitably insufficient to completely describe the articulatory configuration during pronunciation. Therefore, this paper proposes to further augment the model input by concatenating EMA vectors with two other representations. First, the posterior probabilities derived from a phoneme classifier are concatenated with EMA features to provide a linguistic description of each frame for acoustic feature prediction. The classifier is trained to determine the phoneme label of each frame based on the observed EMA features. Second, a cascaded prediction strategy is designed to utilize the predicted spectral features as auxiliary input to boost the prediction accuracy of the excitation features. The results of experiment show that BLSTM-RNNs can achieve a better objective and subjective performance than deep neural networks (DNN) and Gaussian mixture models (GMM) in articulatory-to-acoustic conversion. In addition, our results show that the proposed methods for integrating linguistic representation and utilizing a cascaded prediction strategy can further improve the accuracy of acoustic feature prediction.

Voice Conversion Using Conditional Restricted Boltzmann Machine

Joint Spectral Distribution Modeling Using Restricted Boltzmann Machines For Voice Conversion

Voice Conversion Using Deep Neural Networks with Layer-Wise Generative Training

GMM-based Voice Conversion with Explicit Modelling on Feature Transform

Modeling Spectral Envelopes Using Deep Conditional Restricted Boltzmann Machines for Statistical Parametric Speech Synthesis.

Whisper-to-speech Conversion Using Restricted Boltzmann Machine Arrays

Text-Independent Voice Conversion Based on State Mapped Codebook

Converting Anyone's Voice: End-to-End Expressive Voice Conversion with a Conditional Diffusion Model

Voice Conversion Based on Gaussian Mixture Modules with Minimum Distance Spectral Mapping

An improved method for voice conversion based on Gaussian mixture model

Using bidirectional associative memories for joint spectral envelope modeling in voice conversion

A Compact Framework For Voice Conversion Using Wavenet Conditioned On Phonetic Posteriorgrams

Improving the Performance of HMM-based Voice Conversion Using Context Clustering Decision Tree and Appropriate Regression Matrix Format.

Voice conversion using dynamic inter-frame features

Modeling spectral envelopes using restricted Boltzmann machines for statistical parametric speech synthesis

Voice Conversion Using Generative Trained Deep Neural Networks with Multiple Frame Spectral Envelopes

Articulatory-to-acoustic Conversion Using BLSTM-RNNs with Augmented Input Representation.

Spectro-Temporal Modelling with Time-Frequency LSTM and Structured Output Layer for Voice Conversion

Complex-Valued Restricted Boltzmann Machine for Direct Speech Parameterization from Complex Spectra

Lip Movement Generation Using Restricted Boltzmann Machines For Visual Speech Synthesis

A noise-robust voice conversion method with controllable background sounds