TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control

Yu Zhang,Ziyue Jiang,Ruiqi Li,Changhao Pan,Jinzheng He,Rongjie Huang,Chuxin Wang,Zhou Zhao
2024-10-03
Abstract:Zero-shot singing voice synthesis (SVS) with style transfer and style control aims to generate high-quality singing voices with unseen timbres and styles (including singing method, emotion, rhythm, technique, and pronunciation) from audio and text prompts. However, the multifaceted nature of singing styles poses a significant challenge for effective modeling, transfer, and control. Furthermore, current SVS models often fail to generate singing voices rich in stylistic nuances for unseen singers. To address these challenges, we introduce TCSinger, the first zero-shot SVS model for style transfer across cross-lingual speech and singing styles, along with multi-level style control. Specifically, TCSinger proposes three primary modules: 1) the clustering style encoder employs a clustering vector quantization model to stably condense style information into a compact latent space; 2) the Style and Duration Language Model (S\&D-LM) concurrently predicts style information and phoneme duration, which benefits both; 3) the style adaptive decoder uses a novel mel-style adaptive normalization method to generate singing voices with enhanced details. Experimental results show that TCSinger outperforms all baseline models in synthesis quality, singer similarity, and style controllability across various tasks, including zero-shot style transfer, multi-level style control, cross-lingual style transfer, and speech-to-singing style transfer. Singing voice samples can be accessed at <a class="link-external link-https" href="https://tcsinger.github.io/" rel="external noopener nofollow">this https URL</a>.
Audio and Speech Processing,Computation and Language,Sound
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are the challenges encountered by existing Singing Voice Synthesis (SVS) models in generating high - quality and diverse - style singing voices in zero - shot scenarios. Specifically, the paper aims to address the following two main issues: 1. **Multi - aspect singing style modeling and control**: - Singing styles consist of multiple aspects, such as singing methods (e.g., bel canto), emotions (e.g., happy or sad), rhythms (including the handling of notes and transitions), techniques (e.g., falsetto), and pronunciations (e.g., enunciation). The diversity of these styles makes effective modeling, transfer, and control very challenging. - Existing SVS models are usually only able to capture limited style information, while ignoring other important style features and being unable to perform multi - level style control. 2. **Generating singing voices rich in style details for unseen singers**: - Existing models perform poorly when dealing with unseen singers and have difficulty generating singing voices rich in style details. This is mainly because these models usually assume that the target singer is known during the training phase, resulting in a decline in performance in zero - shot tasks. To address these challenges, the paper proposes TCSinger, a zero - shot singing voice synthesis model specifically for cross - language voices and singing styles, with multi - level style control capabilities. TCSinger achieves this goal through the following three key modules: - **Clustering Style Encoder**: Use the Clustering Vector Quantization (CVQ) model to stably compress style information into a compact latent space, thereby facilitating subsequent prediction and enhancing training stability and reconstruction quality. - **Style and Duration Language Model (S&D - LM)**: Combine audio and text prompts to simultaneously predict style information and phoneme durations, thereby improving the effect of style transfer and control. - **Style Adaptive Decoder**: Adopt a novel mel - style adaptive normalization method to generate detailed - rich singing voices, making the generated singing voices more natural and diverse. Experimental results show that TCSinger outperforms existing baseline models in a variety of tasks, including zero - shot style transfer, multi - level style control, cross - language style transfer, and voice - to - singing style transfer tasks.