Low-latency Real-time Voice Conversion on CPU

Konstantine Sadov,Matthew Hutter,Asara Near
2023-11-02
Abstract:We adapt the architectures of previous audio manipulation and generation neural networks to the task of real-time any-to-one voice conversion. Our resulting model, LLVC ($\textbf{L}$ow-latency $\textbf{L}$ow-resource $\textbf{V}$oice $\textbf{C}$onversion), has a latency of under 20ms at a bitrate of 16kHz and runs nearly 2.8x faster than real-time on a consumer CPU. LLVC uses both a generative adversarial architecture as well as knowledge distillation in order to attain this performance. To our knowledge LLVC achieves both the lowest resource usage as well as the lowest latency of any open-source voice conversion model. We provide open-source samples, code, and pretrained model weights at <a class="link-external link-https" href="https://github.com/KoeAI/LLVC" rel="external noopener nofollow">this https URL</a>.
Sound,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve real - time voice conversion (RTC) with low latency and low resource requirements. Specifically, the authors hope to develop an any - to - one voice conversion model that can run on consumer - grade CPUs with extremely low latency (less than 20 milliseconds) and high - efficiency performance. This model not only needs to achieve fast processing while maintaining high - quality output, but also needs to perform well in streaming audio processing, that is, it can process continuous audio input in real - time without introducing significant latency. ### Main contributions of the paper 1. **Low latency and low resource consumption**: The LLVC model achieves a latency of less than 20 milliseconds and can run at a speed nearly 2.8 times real - time on consumer - grade CPUs. 2. **Combination of generative adversarial networks and knowledge distillation**: By combining generative adversarial networks (GAN) and knowledge distillation techniques, LLVC can reduce the demand for computational resources while maintaining high - quality output. 3. **Open - source implementation**: The authors provide open - source code, pre - trained model weights, and samples, enabling other researchers and developers to reproduce and further improve this model. ### Specific problem description - **Voice conversion task**: Convert the voice of any speaker into the style of a specific target speaker while retaining the content and intonation of the original voice. - **Real - time challenge**: Not only must the processing speed of the network exceed real - time requirements, but also low latency must be ensured and the dependence on future audio context must be minimized. - **Application in low - resource environments**: In order to make the model suitable for a wide range of consumer devices (such as laptops and smartphones), its performance in low - resource computing environments must be optimized. Through these efforts, LLVC has become one of the voice conversion models with the lowest resource consumption and the least latency among the currently known open - source voice conversion models.