RAVE for Speech: Efficient Voice Conversion at High Sampling Rates

Anders R. Bargum,Simon Lajboschitz,Cumhur Erkut
2024-08-29
Abstract:Voice conversion has gained increasing popularity within the field of audio manipulation and speech synthesis. Often, the main objective is to transfer the input identity to that of a target speaker without changing its linguistic content. While current work provides high-fidelity solutions they rarely focus on model simplicity, high-sampling rate environments or stream-ability. By incorporating speech representation learning into a generative timbre transfer model, traditionally created for musical purposes, we investigate the realm of voice conversion generated directly in the time domain at high sampling rates. More specifically, we guide the latent space of a baseline model towards linguistically relevant representations and condition it on external speaker information. Through objective and subjective assessments, we demonstrate that the proposed solution can attain levels of naturalness, quality, and intelligibility comparable to those of a state-of-the-art solution for seen speakers, while significantly decreasing inference time. However, despite the presence of target speaker characteristics in the converted output, the actual similarity to unseen speakers remains a challenge.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key challenges in the field of Voice Conversion (VC), specifically including: 1. **Efficient voice conversion in high - sampling - rate environments**: - Although current voice conversion models can provide high - quality outputs, most models perform poorly in high - sampling - rate (such as 48kHz) environments and are difficult to achieve real - time processing. The paper proposes a voice conversion method that can operate efficiently at high sampling rates. 2. **Simplifying the model structure to achieve real - time processing**: - Existing voice conversion models are usually very complex and difficult to be applied in real - time scenarios. In this paper, by simplifying the model structure and optimizing the inference time, the model can perform real - time voice conversion with low latency, which is suitable for application scenarios such as games and digital audio workstations (DAW). 3. **Decoupling of content and speaker information**: - In order to improve the quality and controllability of conversion, the paper proposes a method to decouple the content information and speaker information of the input voice. This can more precisely control the features of the converted voice, making it closer to the target speaker's voice. 4. **Zero - shot Voice Conversion**: - Zero - shot voice conversion refers to the ability to generate realistic voices for target speakers that have not been seen during the training process. The paper explores how to improve the generalization ability for unseen speakers while maintaining high - quality output. ### Main contributions 1. **Removing variational inference and introducing conditional auto - encoders**: - The variational inference part in the original RAVE model is removed and replaced with a conditional auto - encoder based on Feature - wise Linear Modulation (FiLM), and trained with multi - resolution STFT loss and three different discriminators. 2. **Using pre - trained speaker embedding networks to inject speaker information**: - Speaker embedding information is extracted through the Fast ResNet - 34 speaker encoder to ensure that the encoder can separate content and speaker information. 3. **Information perturbation and response - based knowledge distillation techniques**: - Information perturbation and response - based knowledge distillation techniques are introduced to help the encoder capture language - related soft speech units, thereby improving the quality of content representation. Through these improvements, the method proposed in the paper can not only operate efficiently at high sampling rates, but also significantly reduce the inference time while maintaining high naturalness, quality and comprehensibility.