HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer

Sang-Hoon Lee,Ha-Yeong Choi,Hyung-Seok Oh,Seong-Whan Lee
2023-07-30
Abstract:Despite rapid progress in the voice style transfer (VST) field, recent zero-shot VST systems still lack the ability to transfer the voice style of a novel speaker. In this paper, we present HierVST, a hierarchical adaptive end-to-end zero-shot VST model. Without any text transcripts, we only use the speech dataset to train the model by utilizing hierarchical variational inference and self-supervised representation. In addition, we adopt a hierarchical adaptive generator that generates the pitch representation and waveform audio sequentially. Moreover, we utilize unconditional generation to improve the speaker-relative acoustic capacity in the acoustic representation. With a hierarchical adaptive structure, the model can adapt to a novel voice style and convert speech progressively. The experimental results demonstrate that our method outperforms other VST models in zero-shot VST scenarios. Audio samples are available at \url{<a class="link-external link-https" href="https://hiervst.github.io/" rel="external noopener nofollow">this https URL</a>}.
Sound,Artificial Intelligence,Multimedia,Audio and Speech Processing
What problem does this paper attempt to address?
The paper attempts to address the issue that existing systems in zero-shot voice style transfer (VST) cannot effectively transfer voice styles when faced with new speakers. Specifically, despite the rapid progress in the field of voice style transfer in recent years, current zero-shot VST systems still lack the ability to effectively transfer the voice styles of new speakers. To solve this problem, the authors propose HierVST, a hierarchical adaptive end-to-end zero-shot voice style transfer model. ### Main Issues 1. **Voice Style Transfer for New Speakers**: Existing zero-shot VST systems perform poorly when dealing with new speakers and cannot effectively transfer their voice styles. 2. **Data Dependency**: Many existing VST models require paired text-audio datasets for training, which limits their application scope. 3. **Trade-off Between Audio Quality and Style Transfer**: Some methods improve audio quality at the expense of style transfer performance, and vice versa. ### Solutions 1. **Hierarchical Adaptive Generator (HAG)**: By using a hierarchical adaptive generator, the model can gradually generate pitch representations and waveform audio, better adapting to new voice styles. 2. **Multi-Path Self-Supervised Speech Representation**: Utilizing self-supervised learning methods to extract multiple representations from a single speech, including linguistic and acoustic representations, to enhance the model's robustness and adaptability. 3. **Unconditional Generation**: By introducing unconditional generation, the model's ability to generate acoustic representations is improved, thereby enhancing speaker adaptability. 4. **Prosody Distillation**: Through prosody distillation techniques, the quality of linguistic representations is enhanced, further improving the model's performance. ### Experimental Results Experimental results show that HierVST significantly outperforms other VST models in zero-shot VST scenarios, excelling in both audio quality and speaker similarity. Additionally, the model also demonstrates good performance in many-to-many VST tasks. ### Conclusion HierVST, through its hierarchical adaptive structure, successfully addresses the shortcomings of existing zero-shot VST systems in transferring voice styles of new speakers, providing new directions for future research in voice style transfer.