ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps

Yulin Song,Guorui Sang,Jing Yu,Chuangbai Xiao
2024-10-20
Abstract:Singing voice synthesis (SVS) system is expected to generate high-fidelity singing voice from given music scores (lyrics, duration and pitch). Recently, diffusion models have performed well in this field. However, sacrificing inference speed to exchange with high-quality sample generation limits its application scenarios. In order to obtain high quality synthetic singing voice more efficiently, we propose a singing voice synthesis method based on the consistency model, ConSinger, to achieve high-fidelity singing voice synthesis with minimal steps. The model is trained by applying consistency constraint and the generation quality is greatly improved at the expense of a small amount of inference speed. Our experiments show that ConSinger is highly competitive with the baseline model in terms of generation speed and quality. Audio samples are available at <a class="link-external link-https" href="https://keylxiao.github.io/consinger" rel="external noopener nofollow">this https URL</a>.
Sound,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?