Abstract:Zero-shot singing voice synthesis (SVS) with style transfer and style control aims to generate high-quality singing voices with unseen timbres and styles (including singing method, emotion, rhythm, technique, and pronunciation) from audio and text prompts. However, the multifaceted nature of singing styles poses a significant challenge for effective modeling, transfer, and control. Furthermore, current SVS models often fail to generate singing voices rich in stylistic nuances for unseen singers. To address these challenges, we introduce TCSinger, the first zero-shot SVS model for style transfer across cross-lingual speech and singing styles, along with multi-level style control. Specifically, TCSinger proposes three primary modules: 1) the clustering style encoder employs a clustering vector quantization model to stably condense style information into a compact latent space; 2) the Style and Duration Language Model (S\&D-LM) concurrently predicts style information and phoneme duration, which benefits both; 3) the style adaptive decoder uses a novel mel-style adaptive normalization method to generate singing voices with enhanced details. Experimental results show that TCSinger outperforms all baseline models in synthesis quality, singer similarity, and style controllability across various tasks, including zero-shot style transfer, multi-level style control, cross-lingual style transfer, and speech-to-singing style transfer. Singing voice samples can be accessed at <a class="link-external link-https" href="https://tcsinger.github.io/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the challenges encountered by existing Singing Voice Synthesis (SVS) models in generating high - quality and diverse - style singing voices in zero - shot scenarios. Specifically, the paper aims to address the following two main issues: 1. **Multi - aspect singing style modeling and control**: - Singing styles consist of multiple aspects, such as singing methods (e.g., bel canto), emotions (e.g., happy or sad), rhythms (including the handling of notes and transitions), techniques (e.g., falsetto), and pronunciations (e.g., enunciation). The diversity of these styles makes effective modeling, transfer, and control very challenging. - Existing SVS models are usually only able to capture limited style information, while ignoring other important style features and being unable to perform multi - level style control. 2. **Generating singing voices rich in style details for unseen singers**: - Existing models perform poorly when dealing with unseen singers and have difficulty generating singing voices rich in style details. This is mainly because these models usually assume that the target singer is known during the training phase, resulting in a decline in performance in zero - shot tasks. To address these challenges, the paper proposes TCSinger, a zero - shot singing voice synthesis model specifically for cross - language voices and singing styles, with multi - level style control capabilities. TCSinger achieves this goal through the following three key modules: - **Clustering Style Encoder**: Use the Clustering Vector Quantization (CVQ) model to stably compress style information into a compact latent space, thereby facilitating subsequent prediction and enhancing training stability and reconstruction quality. - **Style and Duration Language Model (S&D - LM)**: Combine audio and text prompts to simultaneously predict style information and phoneme durations, thereby improving the effect of style transfer and control. - **Style Adaptive Decoder**: Adopt a novel mel - style adaptive normalization method to generate detailed - rich singing voices, making the generated singing voices more natural and diverse. Experimental results show that TCSinger outperforms existing baseline models in a variety of tasks, including zero - shot style transfer, multi - level style control, cross - language style transfer, and voice - to - singing style transfer tasks.

TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control

StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis

Zero-Shot Sing Voice Conversion: built upon clustering-based phoneme representations

UniSinger: Unified End-to-End Singing Voice Synthesis with Cross-Modality Information Matching

DeepSinger: Singing Voice Synthesis with Data Mined From the Web

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec

MR-SVS: Singing Voice Synthesis with Multi-Reference Encoder

U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling for Zero-Shot Voice Cloning

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion

StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation

Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt

SaMoye: Zero-shot Singing Voice Conversion Model Based on Feature Disentanglement and Enhancement

SiFiSinger: A High-Fidelity End-to-End Singing Voice Synthesizer based on Source-filter Model

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

BiSinger: Bilingual Singing Voice Synthesis

RMSSinger: Realistic-Music-Score based Singing Voice Synthesis

ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders

RealSinger: Ultra-realistic singing voice generation via stochastic differential equations

LDM-SVC: Latent Diffusion Model Based Zero-Shot Any-to-Any Singing Voice Conversion with Singer Guidance

HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis