Abstract:We adapt the architectures of previous audio manipulation and generation neural networks to the task of real-time any-to-one voice conversion. Our resulting model, LLVC ($\textbf{L}$ow-latency $\textbf{L}$ow-resource $\textbf{V}$oice $\textbf{C}$onversion), has a latency of under 20ms at a bitrate of 16kHz and runs nearly 2.8x faster than real-time on a consumer CPU. LLVC uses both a generative adversarial architecture as well as knowledge distillation in order to attain this performance. To our knowledge LLVC achieves both the lowest resource usage as well as the lowest latency of any open-source voice conversion model. We provide open-source samples, code, and pretrained model weights at <a class="link-external link-https" href="https://github.com/KoeAI/LLVC" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve real - time voice conversion (RTC) with low latency and low resource requirements. Specifically, the authors hope to develop an any - to - one voice conversion model that can run on consumer - grade CPUs with extremely low latency (less than 20 milliseconds) and high - efficiency performance. This model not only needs to achieve fast processing while maintaining high - quality output, but also needs to perform well in streaming audio processing, that is, it can process continuous audio input in real - time without introducing significant latency. ### Main contributions of the paper 1. **Low latency and low resource consumption**: The LLVC model achieves a latency of less than 20 milliseconds and can run at a speed nearly 2.8 times real - time on consumer - grade CPUs. 2. **Combination of generative adversarial networks and knowledge distillation**: By combining generative adversarial networks (GAN) and knowledge distillation techniques, LLVC can reduce the demand for computational resources while maintaining high - quality output. 3. **Open - source implementation**: The authors provide open - source code, pre - trained model weights, and samples, enabling other researchers and developers to reproduce and further improve this model. ### Specific problem description - **Voice conversion task**: Convert the voice of any speaker into the style of a specific target speaker while retaining the content and intonation of the original voice. - **Real - time challenge**: Not only must the processing speed of the network exceed real - time requirements, but also low latency must be ensured and the dependence on future audio context must be minimized. - **Application in low - resource environments**: In order to make the model suitable for a wide range of consumer devices (such as laptops and smartphones), its performance in low - resource computing environments must be optimized. Through these efforts, LLVC has become one of the voice conversion models with the lowest resource consumption and the least latency among the currently known open - source voice conversion models.

Low-latency Real-time Voice Conversion on CPU

ALO-VC: Any-to-any Low-latency One-shot Voice Conversion

A Study on Low-Latency Recognition-Synthesis-Based Any-to-One Voice Conversion

StreamVC: Real-Time Low-Latency Voice Conversion

Voice Conversion for Stuttered Speech, Instruments, Unseen Languages and Textually Described Voices

NeuralVC: Any-to-Any Voice Conversion Using Neural Networks Decoder for Real-Time Voice Conversion

Non-autoregressive real-time Accent Conversion model with voice cloning

RAVE for Speech: Efficient Voice Conversion at High Sampling Rates

Voice Conversion With Just Nearest Neighbors

DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion

Efficient Non-Autoregressive GAN Voice Conversion using VQWav2vec Features and Dynamic Convolution

How far are we from robust voice conversion: a survey

Beyond Voice Identity Conversion: Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations

Iteratively Improving Speech Recognition and Voice Conversion

Vocoder-Free Non-Parallel Conversion of Whispered Speech With Masked Cycle-Consistent Generative Adversarial Networks

Adversarial Post-Processing of Voice Conversion Against Spoofing Detection

LM-VC: Zero-Shot Voice Conversion via Speech Generation Based on Language Models

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

Non-parallel Sequence-to-Sequence Voice Conversion for Arbitrary Speakers.

A Multidomain Generative Adversarial Network for Hoarse-to-Normal Voice Conversion

NVCGAN: Leveraging Generative Adversarial Networks for Robust Voice Conversion