Abstract:Voice conversion aims to convert source speech into a target voice using recordings of the target speaker as a reference. Newer models are producing increasingly realistic output. But what happens when models are fed with non-standard data, such as speech from a user with a speech impairment? We investigate how a recent voice conversion model performs on non-standard downstream voice conversion tasks. We use a simple but robust approach called k-nearest neighbors voice conversion (kNN-VC). We look at four non-standard applications: stuttered voice conversion, cross-lingual voice conversion, musical instrument conversion, and text-to-voice conversion. The latter involves converting to a target voice specified through a text description, e.g. "a young man with a high-pitched voice". Compared to an established baseline, we find that kNN-VC retains high performance in stuttered and cross-lingual voice conversion. Results are more mixed for the musical instrument and text-to-voice conversion tasks. E.g., kNN-VC works well on some instruments like drums but not on others. Nevertheless, this shows that voice conversion models - and kNN-VC in particular - are increasingly applicable in a range of non-standard downstream tasks. But there are still limitations when samples are very far from the training distribution. Code, samples, trained models: <a class="link-external link-https" href="https://rf5.github.io/sacair2023-knnvc-demo/" rel="external noopener nofollow">this https URL</a>.

Text-Independent Voice Conversion Based on State Mapped Codebook

A novel voice conversion system based on codebook mapping with phoneme-tied weighting.

Transfer Learning from Speech Synthesis to Voice Conversion with Non-Parallel Training Data

Voice Conversion by Cascading Automatic Speech Recognition and Text-to-Speech Synthesis with Prosody Transfer.

A Compact Framework For Voice Conversion Using Wavenet Conditioned On Phonetic Posteriorgrams

Supervisory Data Alignment for Text-Independent Voice Conversion

Voice Conversion Based on Speaker Independent Model

An improved method for voice conversion based on Gaussian mixture model

Towards General-Purpose Text-Instruction-Guided Voice Conversion

A hybrid GMM and codebook mapping method for spectral conversion

Voice Conversion for Stuttered Speech, Instruments, Unseen Languages and Textually Described Voices

A hybrid method to convert acoustic features for voice conversion

GMM-based Voice Conversion with Explicit Modelling on Feature Transform

Non-Parallel Voice Conversion with Autoregressive Conversion Model and Duration Adjustment

Voice Conversion Based on Gaussian Mixture Modules with Minimum Distance Spectral Mapping

Voice Conversion Using Conditional Restricted Boltzmann Machine

Improving the Performance of HMM-based Voice Conversion Using Context Clustering Decision Tree and Appropriate Regression Matrix Format.

Voice Conversion with Smoothed GMM and MAP Adaptation

Deep Neural Network Based Voice Conversion with A Large Synthesized Parallel Corpus

A Vocoder-free WaveNet Voice Conversion with Non-Parallel Data

Voice conversion using dynamic inter-frame features