Abstract:This paper presents a new spectral envelope conversion method using deep neural networks (DNNs). The conventional joint density Gaussian mixture model (JDGMM) based spectral conversion methods perform stably and effectively. However, the speech generated by these methods suffer severe quality degradation due to the following two factors: 1) inadequacy of JDGMM in modeling the distribution of spectral features as well as the non-linear mapping relationship between the source and target speakers, 2) spectral detail loss caused by the use of high-level spectral features such as mel-cepstra. Previously, we have proposed to use the mixture of restricted Boltzmann machines (MoRBM) and the mixture of Gaussian bidirectional associative memories (MoGBAM) to cope with these problems. In this paper, we propose to use a DNN to construct a global non-linear mapping relationship between the spectral envelopes of two speakers. The proposed DNN is generatively trained by cascading two RBMs, which model the distributions of spectral envelopes of source and target speakers respectively, using a Bernoulli BAM (BBAM). Therefore, the proposed training method takes the advantage of the strong modeling ability of RBMs in modeling the distribution of spectral envelopes and the superiority of BAMs in deriving the conditional distributions for conversion. Careful comparisons and analysis among the proposed method and some conventional methods are presented in this paper. The subjective results show that the proposed method can significantly improve the performance in terms of both similarity and naturalness compared to conventional methods.

GMM-based Voice Conversion with Explicit Modelling on Feature Transform

An improved method for voice conversion based on Gaussian mixture model

Voice Conversion Based on Gaussian Mixture Modules with Minimum Distance Spectral Mapping

Voice conversion using dynamic inter-frame features

Joint Spectral Distribution Modeling Using Restricted Boltzmann Machines For Voice Conversion

Non-parallel training for voice conversion based on FT-GMM

A hybrid method to convert acoustic features for voice conversion

Voice Conversion with Smoothed GMM and MAP Adaptation

Voice Conversion Based on Speaker Independent Model

Voice Conversion Using Deep Neural Networks with Layer-Wise Generative Training

A Parametric Model for Voice Conversion

An Improved Spectral And Prosodic Transformation Method In Straight-Based Voice Conversion

A hybrid GMM and codebook mapping method for spectral conversion

Voice Conversion Using Conditional Restricted Boltzmann Machine

Improving the Performance of HMM-based Voice Conversion Using Context Clustering Decision Tree and Appropriate Regression Matrix Format.

Using bidirectional associative memories for joint spectral envelope modeling in voice conversion

Sequence-to-Sequence Acoustic Modeling for Voice Conversion

NVCGAN: Leveraging Generative Adversarial Networks for Robust Voice Conversion

Text-Independent Voice Conversion Based on State Mapped Codebook

Voice Conversion with Denoising Diffusion Probabilistic GAN Models

Spectro-Temporal Modelling with Time-Frequency Lstm and Structured Output Layer for Voice Conversion