SaMoye: Zero-shot Singing Voice Conversion Model Based on Feature Disentanglement and Enhancement

Zihao Wang,Le Ma,Yongsheng Feng,Xin Pan,Yuhang Jin,Kejun Zhang
2024-10-17
Abstract:Singing voice conversion (SVC) aims to convert a singer's voice to another singer's from a reference audio while keeping the original semantics. However, existing SVC methods can hardly perform zero-shot due to incomplete feature disentanglement or dependence on the speaker look-up table. We propose the first open-source high-quality zero-shot SVC model SaMoye that can convert singing to human and non-human timbre. SaMoye disentangles the singing voice's features into content, timbre, and pitch features, where we combine multiple ASR models and compress the content features to reduce timbre leaks. Besides, we enhance the timbre features by unfreezing the speaker encoder and mixing the speaker embedding with top-3 similar speakers. We also establish an unparalleled large-scale dataset to guarantee zero-shot performance, which comprises more than 1,815 hours of pure singing voice and 6,367 speakers. We conduct objective and subjective experiments to find that SaMoye outperforms other models in zero-shot SVC tasks even under extreme conditions like converting singing to animals' timbre. The code and weight of SaMoye are available on <a class="link-external link-https" href="https://github.com/CarlWangChina/SaMoye-SVC" rel="external noopener nofollow">this https URL</a>. The weights, code, dataset, and documents of SaMoye are publicly available on \url{<a class="link-external link-https" href="https://github.com/CarlWangChina/SaMoye-SVC" rel="external noopener nofollow">this https URL</a>}.
Sound,Artificial Intelligence,Multimedia,Audio and Speech Processing
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the challenges in **Zero - shot Singing Voice Conversion (SVC)**. Specifically, existing SVC methods perform poorly when dealing with unseen singers or voice features. The main reasons include: 1. **Incomplete feature decoupling**: Existing methods cannot completely separate the content, timbre, and pitch features in the audio, resulting in the converted audio still having the timbre features of the original audio. 2. **Dependence on speaker lookup tables**: Some methods use speaker lookup tables to extract timbre features, which makes it difficult for the model to be extended to unseen speakers. 3. **Limited data volume**: Existing SVC datasets are relatively small, especially parallel data (data of multiple singers singing the same song) is very limited, resulting in insufficient generalization ability of the model. To solve these problems, the author proposes a zero - shot high - quality SVC model named **SaMoye** and establishes a large - scale SVC dataset. SaMoye improves existing methods in the following ways: - **Feature decoupling and enhancement**: Decompose singing voice features into content, timbre, and pitch features, and combine multiple Automatic Speech Recognition (ASR) models to compress content features to reduce timbre leakage. - **Timbre feature enhancement**: Enhance timbre features by unfreezing the speaker encoder and mixing similar speaker embeddings. - **Large - scale dataset**: Construct a dataset containing more than 1,815 hours of pure singing voices and 6,367 speakers to ensure zero - shot performance. In addition, the author also verifies the superiority of SaMoye in zero - shot SVC tasks through objective and subjective experiments, especially its performance under extreme conditions (such as converting singing voices into animal timbres). ### Summary The main contributions of this paper are: - Proposing the first open - source high - quality zero - shot SVC model SaMoye, which can convert singing voices into human or even non - human timbres. - Evaluating different combinations of content features and their compression methods (such as k - means and vector quantization) to reduce timbre leakage. - Verifying the high performance of SaMoye in human and non - human timbre conversion through objective and subjective experiments.