Abstract:This paper presents a new spectral envelope conversion method using deep neural networks (DNNs). The conventional joint density Gaussian mixture model (JDGMM) based spectral conversion methods perform stably and effectively. However, the speech generated by these methods suffer severe quality degradation due to the following two factors: 1) inadequacy of JDGMM in modeling the distribution of spectral features as well as the non-linear mapping relationship between the source and target speakers, 2) spectral detail loss caused by the use of high-level spectral features such as mel-cepstra. Previously, we have proposed to use the mixture of restricted Boltzmann machines (MoRBM) and the mixture of Gaussian bidirectional associative memories (MoGBAM) to cope with these problems. In this paper, we propose to use a DNN to construct a global non-linear mapping relationship between the spectral envelopes of two speakers. The proposed DNN is generatively trained by cascading two RBMs, which model the distributions of spectral envelopes of source and target speakers respectively, using a Bernoulli BAM (BBAM). Therefore, the proposed training method takes the advantage of the strong modeling ability of RBMs in modeling the distribution of spectral envelopes and the superiority of BAMs in deriving the conditional distributions for conversion. Careful comparisons and analysis among the proposed method and some conventional methods are presented in this paper. The subjective results show that the proposed method can significantly improve the performance in terms of both similarity and naturalness compared to conventional methods.

Denoising Recurrent Neural Network for Deep Bidirectional Lstm Based Voice Conversion

Voice Conversion Using Deep Neural Networks with Layer-Wise Generative Training

Spectral Conversion Using Deep Neural Networks Trained with Multi-Source Speakers

Deep Neural Network Based Voice Conversion with A Large Synthesized Parallel Corpus

Denoising Deep Neural Networks Based Voice Activity Detection

Voice Conversion with Denoising Diffusion Probabilistic GAN Models

Noise-Robust Voice Conversion by Conditional Denoising Training Using Latent Variables of Recording Quality and Environment

A noise-robust voice conversion method with controllable background sounds

Sequence-to-Sequence Acoustic Modeling for Voice Conversion

GTDNN-Based Voice Conversion Using DAEs with Binary Distributed Hidden Units

Noise-robust voice conversion using adversarial training with multi-feature decoupling

A regression approach to speech enhancement based on deep neural networks

Spectro-Temporal Modelling with Time-Frequency Lstm and Structured Output Layer for Voice Conversion

A LSTM-Based Joint Progressive Learning Framework for Simultaneous Speech Dereverberation and Denoising

Non-Parallel Voice Conversion with Autoregressive Conversion Model and Duration Adjustment

Residual Speaker Representation for One-Shot Voice Conversion

DDDM-VC: Decoupled Denoising Diffusion Models with Disentangled Representation and Prior Mixup for Verified Robust Voice Conversion

Disentangling Content and Fine-Grained Prosody Information Via Hybrid ASR Bottleneck Features for Voice Conversion

Multiple-target Deep Learning for LSTM-RNN Based Speech Enhancement

Non-Parallel Sequence-to-Sequence Voice Conversion with Disentangled Linguistic and Speaker Representations

Reconstructing Dual Learning for Neural Voice Conversion Using Relatively Few Samples.