Abstract:Voice conversion has gained increasing popularity within the field of audio manipulation and speech synthesis. Often, the main objective is to transfer the input identity to that of a target speaker without changing its linguistic content. While current work provides high-fidelity solutions they rarely focus on model simplicity, high-sampling rate environments or stream-ability. By incorporating speech representation learning into a generative timbre transfer model, traditionally created for musical purposes, we investigate the realm of voice conversion generated directly in the time domain at high sampling rates. More specifically, we guide the latent space of a baseline model towards linguistically relevant representations and condition it on external speaker information. Through objective and subjective assessments, we demonstrate that the proposed solution can attain levels of naturalness, quality, and intelligibility comparable to those of a state-of-the-art solution for seen speakers, while significantly decreasing inference time. However, despite the presence of target speaker characteristics in the converted output, the actual similarity to unseen speakers remains a challenge.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key challenges in the field of Voice Conversion (VC), specifically including: 1. **Efficient voice conversion in high - sampling - rate environments**: - Although current voice conversion models can provide high - quality outputs, most models perform poorly in high - sampling - rate (such as 48kHz) environments and are difficult to achieve real - time processing. The paper proposes a voice conversion method that can operate efficiently at high sampling rates. 2. **Simplifying the model structure to achieve real - time processing**: - Existing voice conversion models are usually very complex and difficult to be applied in real - time scenarios. In this paper, by simplifying the model structure and optimizing the inference time, the model can perform real - time voice conversion with low latency, which is suitable for application scenarios such as games and digital audio workstations (DAW). 3. **Decoupling of content and speaker information**: - In order to improve the quality and controllability of conversion, the paper proposes a method to decouple the content information and speaker information of the input voice. This can more precisely control the features of the converted voice, making it closer to the target speaker's voice. 4. **Zero - shot Voice Conversion**: - Zero - shot voice conversion refers to the ability to generate realistic voices for target speakers that have not been seen during the training process. The paper explores how to improve the generalization ability for unseen speakers while maintaining high - quality output. ### Main contributions 1. **Removing variational inference and introducing conditional auto - encoders**: - The variational inference part in the original RAVE model is removed and replaced with a conditional auto - encoder based on Feature - wise Linear Modulation (FiLM), and trained with multi - resolution STFT loss and three different discriminators. 2. **Using pre - trained speaker embedding networks to inject speaker information**: - Speaker embedding information is extracted through the Fast ResNet - 34 speaker encoder to ensure that the encoder can separate content and speaker information. 3. **Information perturbation and response - based knowledge distillation techniques**: - Information perturbation and response - based knowledge distillation techniques are introduced to help the encoder capture language - related soft speech units, thereby improving the quality of content representation. Through these improvements, the method proposed in the paper can not only operate efficiently at high sampling rates, but also significantly reduce the inference time while maintaining high naturalness, quality and comprehensibility.

RAVE for Speech: Efficient Voice Conversion at High Sampling Rates

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme

Rhythm Modeling for Voice Conversion

Voice Conversion for Stuttered Speech, Instruments, Unseen Languages and Textually Described Voices

Residual Speaker Representation for One-Shot Voice Conversion

Voice Conversion towards Arbitrary Speakers With Limited Data.

Voice Conversion With Just Nearest Neighbors

Phoneme Hallucinator: One-shot Voice Conversion via Set Expansion

Low-latency Real-time Voice Conversion on CPU

ALO-VC: Any-to-any Low-latency One-shot Voice Conversion

Vocoder-Free Non-Parallel Conversion of Whispered Speech With Masked Cycle-Consistent Generative Adversarial Networks

ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations

Rhythm-Flexible Voice Conversion Without Parallel Data Using Cycle-GAN over Phoneme Posteriorgram Sequences

How far are we from robust voice conversion: a survey

The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods.

A Study on Low-Latency Recognition-Synthesis-Based Any-to-One Voice Conversion

Iteratively Improving Speech Recognition and Voice Conversion

Highly Controllable Diffusion-based Any-to-Any Voice Conversion Model with Frame-level Prosody Feature

Multi-target Voice Conversion Without Parallel Data by Adversarially Learning Disentangled Audio Representations

WaveVC: Speech and Fundamental Frequency Consistent Raw Audio Voice Conversion