Abstract:This paper presents a new spectral envelope conversion method using deep neural networks (DNNs). The conventional joint density Gaussian mixture model (JDGMM) based spectral conversion methods perform stably and effectively. However, the speech generated by these methods suffer severe quality degradation due to the following two factors: 1) inadequacy of JDGMM in modeling the distribution of spectral features as well as the non-linear mapping relationship between the source and target speakers, 2) spectral detail loss caused by the use of high-level spectral features such as mel-cepstra. Previously, we have proposed to use the mixture of restricted Boltzmann machines (MoRBM) and the mixture of Gaussian bidirectional associative memories (MoGBAM) to cope with these problems. In this paper, we propose to use a DNN to construct a global non-linear mapping relationship between the spectral envelopes of two speakers. The proposed DNN is generatively trained by cascading two RBMs, which model the distributions of spectral envelopes of source and target speakers respectively, using a Bernoulli BAM (BBAM). Therefore, the proposed training method takes the advantage of the strong modeling ability of RBMs in modeling the distribution of spectral envelopes and the superiority of BAMs in deriving the conditional distributions for conversion. Careful comparisons and analysis among the proposed method and some conventional methods are presented in this paper. The subjective results show that the proposed method can significantly improve the performance in terms of both similarity and naturalness compared to conventional methods.

A Novel Pitch Extraction Based on Jointly Trained Deep BLSTM Recurrent Neural Networks with Bottleneck Features

An initial research: Towards accurate pitch extraction for speech synthesis based on BLSTM

Robust Multipitch Estimation Of Piano Sounds Using Deep Spiking Neural Networks

Noise-Robust DSP-Assisted Neural Pitch Estimation with Very Low Complexity

A Novel Unified Framework for Speech Enhancement and Bandwidth Extension Based on Jointly Trained Neural Networks

A Robust and Low Computational Cost Pitch Estimation Method

Dblstm-Based Multi-Task Learning for Pitch Transformation in Voice Conversion

Multi-task Learning of Structured Output Layer Bidirectional LSTMS for Speech Synthesis

Pitch-Aware RNN-T for Mandarin Chinese Mispronunciation Detection and Diagnosis

Efficient pitch‐estimation network for edge devices

Deep Neural Network Derived Bottleneck Features For Accurate Audio Classification

A regression approach to speech enhancement based on deep neural networks

Joint Training for Simultaneous Speech Denoising and Dereverberation with Deep Embedding Representations

Vocal Melody Extraction Via Dnn-Based Pitch Estimation And Salience-Based Pitch Refinement

Deep neural networks based speaker modeling at different levels of phonetic granularity

PitchNet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network

Speech Bandwidth Extension Using Bottleneck Features and Deep Recurrent Neural Networks.

Voice Conversion Using Deep Neural Networks with Layer-Wise Generative Training

Bilevel Joint Unsupervised and Supervised Training for Automatic Speech Recognition

Recurrent Neural Networks and Acoustic Features for Frame-Level Signal-to-Noise Ratio Estimation.

A Novel Research to Artificial Bandwidth Extension Based on Deep BLSTM Recurrent Neural Networks and Exemplar-Based Sparse Representation.