CoMoSVC: Consistency Model-based Singing Voice Conversion

Yiwen Lu,Zhen Ye,Wei Xue,Xu Tan,Qifeng Liu,Yike Guo
2024-01-03
Abstract:The diffusion-based Singing Voice Conversion (SVC) methods have achieved remarkable performances, producing natural audios with high similarity to the target timbre. However, the iterative sampling process results in slow inference speed, and acceleration thus becomes crucial. In this paper, we propose CoMoSVC, a consistency model-based SVC method, which aims to achieve both high-quality generation and high-speed sampling. A diffusion-based teacher model is first specially designed for SVC, and a student model is further distilled under self-consistency properties to achieve one-step sampling. Experiments on a single NVIDIA GTX4090 GPU reveal that although CoMoSVC has a significantly faster inference speed than the state-of-the-art (SOTA) diffusion-based SVC system, it still achieves comparable or superior conversion performance based on both subjective and objective metrics. Audio samples and codes are available at
Audio and Speech Processing,Artificial Intelligence,Machine Learning,Sound
What problem does this paper attempt to address?
This paper proposes a solution to the slow inference speed problem in Singing Voice Conversion (SVC) based on diffusion models. Although existing diffusion models can generate high-quality audio, their iterative sampling process leads to slow inference speed. The paper introduces CoMoSVC, a SVC method based on consistency models, aiming to achieve high-quality generation and high-speed sampling. Firstly, they design a dedicated diffusion-based teacher model for SVC and train a student model through self-consistency attribute to achieve one-step sampling. The experiments show that CoMoSVC significantly accelerates the inference speed while maintaining comparable or better conversion performance with state-of-the-art diffusion-based SVC systems, approximately 500 times faster than SoVITS-SVC and 50 times faster than DiffSVC. Furthermore, CoMoSVC also exhibits improvements in quality and similarity.