VoiceTuner: Self-Supervised Pre-training and Efficient Fine-tuning for Voice Generation
Rongjie Huang,Yongqi Wang,Ruofan Hu,Xiaoshan Xu,Zhiqing Hong,Dongchao Yang,Xize Cheng,Zehan Wang,Ziyue Jiang,Zhenhui Ye,Luping Liu,Siqi Zheng,Zhou Zhao
DOI: https://doi.org/10.1145/3664647.3681695
2024-01-01
Abstract:Voice large language models (LLMs) cast voice synthesis as a language modeling task in a discrete space, and have demonstrated significant progress to date. Despite the recent success, the current development of voice LLMs in low-resource applications is hampered by data scarcity and high computational cost. In this work, we propose VoiceTuner, with a self-supervised pre-training and efficient fine-tuning approach for low-resource voice generation. Specifically, 1) to mitigate data scarcity, we leverage large-scale unlabeled dataset and pre-train VoiceTuner-SSL without pre-defined applications, which can be fine-tuned in downstream tasks; 2) to further reduce the high training cost in complete fine-tuning, we introduce a multiscale transformer adapter to effectively update only around 1% parameters as a plug-and-play module. Experimental results demonstrate that VoiceTuner-SSL presents strong acoustic continuations, and VoiceTuner achieves state-of-the-art results in rich-resource TTS evaluation compared with competitive baseline models. Low-resource (1h, 10h, 30h) downstream applications including zero-shot TTS, instruction TTS, and singing voice synthesis present VoiceTuner's superior audio quality and style similarity with reduced data requirement and computational cost. Audio samples are available at https://VoiceTuner.github.io