Few-Shot Custom Speech Synthesis with Multi-Angle Fusion

Haifeng Zhao,Hao Liu,Mingwei Cao,Yi Liu
DOI: https://doi.org/10.1109/icsp58490.2023.10248799
2023-01-01
Abstract:In this paper, we propose the TDNN-VITS model: an efficient custom speech synthesis system that can synthesize the speech of arbitrary target speakers, well-improving speech quality as well as reducing the amount of adaptive data. Our model consists of a speaker encoding module, which aims to extract speaker timbre information, and a TTS module, which is based on the fully end-to-end VITS model, but improved according to the problems of custom speech. To better improve the speech quality and speaker similarity, we propose a multi-angle fusion speaker embedding approach. With only 10 speech data and about one minute, good results can be achieved by just fine-tuning for half an hour. The experimental results show that our model is improved compared to the previous classical model, and good speech naturalness and speaker similarity are obtained. The ablation experiments also show the effectiveness of our multi-angle fusion.
What problem does this paper attempt to address?