UniSinger: Unified End-to-End Singing Voice Synthesis with Cross-Modality Information Matching

Zhiqing Hong,Chenye Cui,Rongjie Huang,Lichao Zhang,Jinglin Liu,Jinzheng He,Zhou Zhao
DOI: https://doi.org/10.1145/3581783.3612150
2023-01-01
Abstract:Though previous works have shown remarkable achievements in singing voice generation, most existing models focus on one specific application and there is a lack of unified singing voice synthesis models. In addition to low relevance among tasks, different input modalities are one of the most intractable hindrances. Current methods suffer from information confusion and they can not perform precise control. In this work, we propose UniSinger, a unified end-to-end singing voice synthesizer, which integrates three abilities related to singing voice generation: singing voice synthesis (SVS), singing voice conversion (SVC), and singing voice editing (SVE) into a single framework. Specifically, we perform representation disentanglement for controlling different attributes of the singing voice. We further propose a cross-modality information matching method to close the distribution gap between multi-modal inputs and achieve end-to-end training. The experiments conducted on the OpenSinger dataset demonstrate that UniSinger achieves state-of-the-art results in three applications. Further extensive experiments verify the capability of representation disentanglement and information matching, reflecting that UniSinger enjoys great superiority in sample quality, timbre similarity, and multi-task compatibility. Audio samples can be found in https://unisinger.github.io/Samples/.
What problem does this paper attempt to address?