NeuralVC: Any-to-Any Voice Conversion Using Neural Networks Decoder for Real-Time Voice Conversion
Danyang Cao,Zeyi Zhang,Jinyuan Zhang
DOI: https://doi.org/10.1109/lsp.2024.3439469
2024-08-21
IEEE Signal Processing Letters
Abstract:With the advancement of Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) technologies, high-quality speech conversion can now be achieved by extracting source speech content and reconstructing waveforms based on target speaker information. However, current methods still require improvement in terms of inference speed, especially when running on CPUs with slow speeds, greatly limiting the real-time application of speech conversion. To address this issue, we propose a real-time speech conversion model called NeuralVC. Our model is based on the VITS architecture, where the decoder for synthesizing speech plays a crucial role in influencing speech synthesis speed. To obtain speaker-independent content information, we introduce pre-trained HuBERT for extracting speech content features. To improve synthesis speed, we integrate a lightweight neural decoder based on SEANet for synthesizing speech, and modify it to receive and understand speaker information, significantly enhancing the speaker similarity of the converted speech. Additionally, we introduce a pre-trained speaker encoder and combine it with speaker consistency loss to improve the model's conversion ability in unseen scenarios, achieving any-to-any speech conversion. Experimental results demonstrate that our proposed model can achieve high-quality real-time speech conversion and maintain good performance in unseen scenarios.
engineering, electrical & electronic