Abstract:Despite rapid progress in the voice style transfer (VST) field, recent zero-shot VST systems still lack the ability to transfer the voice style of a novel speaker. In this paper, we present HierVST, a hierarchical adaptive end-to-end zero-shot VST model. Without any text transcripts, we only use the speech dataset to train the model by utilizing hierarchical variational inference and self-supervised representation. In addition, we adopt a hierarchical adaptive generator that generates the pitch representation and waveform audio sequentially. Moreover, we utilize unconditional generation to improve the speaker-relative acoustic capacity in the acoustic representation. With a hierarchical adaptive structure, the model can adapt to a novel voice style and convert speech progressively. The experimental results demonstrate that our method outperforms other VST models in zero-shot VST scenarios. Audio samples are available at \url{<a class="link-external link-https" href="https://hiervst.github.io/" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The paper attempts to address the issue that existing systems in zero-shot voice style transfer (VST) cannot effectively transfer voice styles when faced with new speakers. Specifically, despite the rapid progress in the field of voice style transfer in recent years, current zero-shot VST systems still lack the ability to effectively transfer the voice styles of new speakers. To solve this problem, the authors propose HierVST, a hierarchical adaptive end-to-end zero-shot voice style transfer model. ### Main Issues 1. **Voice Style Transfer for New Speakers**: Existing zero-shot VST systems perform poorly when dealing with new speakers and cannot effectively transfer their voice styles. 2. **Data Dependency**: Many existing VST models require paired text-audio datasets for training, which limits their application scope. 3. **Trade-off Between Audio Quality and Style Transfer**: Some methods improve audio quality at the expense of style transfer performance, and vice versa. ### Solutions 1. **Hierarchical Adaptive Generator (HAG)**: By using a hierarchical adaptive generator, the model can gradually generate pitch representations and waveform audio, better adapting to new voice styles. 2. **Multi-Path Self-Supervised Speech Representation**: Utilizing self-supervised learning methods to extract multiple representations from a single speech, including linguistic and acoustic representations, to enhance the model's robustness and adaptability. 3. **Unconditional Generation**: By introducing unconditional generation, the model's ability to generate acoustic representations is improved, thereby enhancing speaker adaptability. 4. **Prosody Distillation**: Through prosody distillation techniques, the quality of linguistic representations is enhanced, further improving the model's performance. ### Experimental Results Experimental results show that HierVST significantly outperforms other VST models in zero-shot VST scenarios, excelling in both audio quality and speaker similarity. Additionally, the model also demonstrates good performance in many-to-many VST tasks. ### Conclusion HierVST, through its hierarchical adaptive structure, successfully addresses the shortcomings of existing zero-shot VST systems in transferring voice styles of new speakers, providing new directions for future research in voice style transfer.

HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer

HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis

Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation

Improving Zero-shot Voice Style Transfer via Disentangled Representation Learning

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling for Zero-Shot Voice Cloning

ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations

StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation

StableVC: Style Controllable Zero-Shot Voice Conversion with Conditional Flow Matching

TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control

Automatic Tuning of Loss Trade-offs without Hyper-parameter Search in End-to-End Zero-Shot Speech Synthesis

AutoCycle-VC: Towards Bottleneck-Independent Zero-Shot Cross-Lingual Voice Conversion

Zero-shot Voice Conversion via Self-supervised Prosody Representation Learning

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

Two-stage and Self-supervised Voice Conversion for Zero-Shot Dysarthric Speech Reconstruction

Zero-shot Cross-lingual Voice Transfer for TTS

VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts