Abstract:Zero-shot speaker cloning aims to synthesize speech for any target speaker unseen during TTS system building, given only a single speech reference of the speaker at hand. Although more practical in real applications, the current zero-shot methods still produce speech with undesirable naturalness and speaker similarity. Moreover, endowing the target speaker with arbitrary speaking styles in the zero-shot setup has not been considered. This is because the unique challenge of zero-shot speaker and style cloning is to learn the disentangled speaker and style representations from only short references representing an arbitrary speaker and an arbitrary style. To address this challenge, we propose U-Style, which employs Grad-TTS as the backbone, particularly cascading a speaker-specific encoder and a style-specific encoder between the text encoder and the diffusion decoder. Thus, leveraging signal perturbation, U-Style is explicitly decomposed into speaker- and style-specific modeling parts, achieving better speaker and style disentanglement. To improve unseen speaker and style modeling ability, these two encoders conduct multi-level speaker and style modeling by skip-connected U-nets, incorporating the representation extraction and information reconstruction process. Besides, to improve the naturalness of synthetic speech, we adopt mean-based instance normalization and style adaptive layer normalization in these encoders to perform representation extraction and condition adaptation, respectively. Experiments show that U-Style significantly surpasses the state-of-the-art methods in unseen speaker cloning regarding naturalness and speaker similarity. Notably, U-Style can transfer the style from an unseen source speaker to another unseen target speaker, achieving flexible combinations of desired speaker timbre and style in zero-shot voice cloning.

StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

UATST: Towards Unpaired Arbitrary Text-Guided Style Transfer with Cross-Space Modulation

Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation

Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?

Referee: Towards reference-free cross-speaker style transfer with low-quality data for expressive speech synthesis

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Improving Zero-shot Voice Style Transfer via Disentangled Representation Learning

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion

U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling for Zero-Shot Voice Cloning

Disentangling Style and Speaker Attributes for TTS Style Transfer

Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis

TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation

Improving Performance of Seen and Unseen Speech Style Transfer in End-to-end Neural TTS

StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation

Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis