Abstract:We propose a Speaker Independent Deep Neural Net (SI-DNN) and Kullback- Leibler Divergence (KLD) based mapping approach to voice conversion without using parallel training data. The acoustic difference between source and target speakers is equalized with SI-DNN via its estimated output posteriors, which serve as a probabilistic mapping from acoustic input frames to the corresponding symbols in the phonetic space. KLD is chosen as an ideal distortion measure to find an appropriate mapping from each input source speaker's frame to that of the target speaker. The mapped acoustic segments of the target speaker form the construction bases for voice conversion. With or without word transcriptions of the target speaker's training data, the approach can be either supervised or unsupervised. In a supervised mode where adequate training data can be utilized to train a conventional, statistical parametric TTS of the target speaker, each input frame of the source speaker is converted to its nearest sub-phonemic "senone". In an unsupervised mode, the frame is converted to the nearest clustered phonetic centroid or a raw speech frame, in the minimum KLD sense. The acoustic trajectory of the converted voice is rendered with the maximum probability trajectory generation algorithm. Both objective and subjective measures used for evaluating voice conversion performance show that the new algorithm performs better than the sequential error minimization based DNN baseline trained with parallel training data.

Voice conversion using coefficient mapping and neural network

Vowels and Prosody Contribution in Neural Network Based Voice Conversion Algorithm with Noisy Training Data

Voice Conversion Using Deep Neural Network in Super-Frame Feature Space

Voice Conversion Based on Linear Prediction Model with Sinusoidal Excitation

A Parametric Model for Voice Conversion

Voice Conversion Based on Unified Dictionary with Clustered Features Between Non-parallel Corpus

Voice Conversion Using Support Vector Regression

Voice Conversion with SI-DNN and KL Divergence Based Mapping Without Parallel Training Data

Sequence-to-Sequence Acoustic Modeling for Voice Conversion

Spectral Mapping Using Kernel Principal Components Regression for Voice Conversion

Pitch Transformation in Neural Network Based Voice Conversion

High-Quality Voice Conversion Using Spectrogram-Based Wavenet Vocoder

Adversarial Post-Processing of Voice Conversion Against Spoofing Detection

Deep Neural Network Based Voice Conversion with A Large Synthesized Parallel Corpus

Voice Conversion for Stuttered Speech, Instruments, Unseen Languages and Textually Described Voices

Voice Conversion Based On Mapping Formants

A Compact Framework For Voice Conversion Using Wavenet Conditioned On Phonetic Posteriorgrams

Optimal Transport Maps are Good Voice Converters

The Voice Conversion Method Based on Sparse Convolutive Non-negative Matrix Factorization

Non-Parallel Voice Conversion with Autoregressive Conversion Model and Duration Adjustment

RAVE for Speech: Efficient Voice Conversion at High Sampling Rates