CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder

Jianwei Cui,Yu Gu,Shihao Chen,Jie Zhang,Liping Chen,Lirong Dai
2024-12-13
Abstract:Singing Voice Synthesis (SVS) aims to generate singing voices of high fidelity and expressiveness. Conventional SVS systems usually utilize an acoustic model to transform a music score into acoustic features, followed by a vocoder to reconstruct the singing voice. It was recently shown that end-to-end modeling is effective in the fields of SVS and Text to Speech (TTS). In this work, we thus present a fully end-to-end SVS method together with a chunkwise streaming inference to address the latency issue for practical usages. Note that this is the first attempt to fully implement end-to-end streaming audio synthesis using latent representations in VAE. We have made specific improvements to enhance the performance of streaming SVS using latent representations. Experimental results demonstrate that the proposed method achieves synthesized audio with high expressiveness and pitch accuracy in both streaming SVS and TTS tasks.
Audio and Speech Processing
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve two main problems in Singing Voice Synthesis (SVS): 1. **High - fidelity and expressive singing voice generation**: Traditional SVS systems usually adopt a two - stage approach, that is, first convert the musical score into acoustic features through an acoustic model, and then reconstruct the singing voice through a vocoder. This method has a significant computational burden when dealing with long sequences and has poor real - time performance. In addition, end - to - end modeling has been proven to be very effective in the fields of SVS and Text to Speech (TTS). 2. **Low - latency real - time audio synthesis**: Existing high - performance models face limitations in computational resources and latency requirements in practical applications, especially in edge devices or online network services. Therefore, a method that can handle audio synthesis tasks in a streaming and autoregressive manner is needed to meet the requirements of practical application scenarios. ### Specific solutions To address the above challenges, the authors propose a brand - new end - to - end chunk - streaming singing voice synthesis system named **ChunkStreamSinger (CSSinger)**. This system is based on the Conditional Variational Autoencoder (CVAE) and has the following innovations: - **First realization of end - to - end streaming audio synthesis**: Use latent representations for streaming audio synthesis, which solves the latency problem existing in traditional methods. - **Improve the input problem of the causal streaming vocoder**: Introduce the Natural Padding strategy to avoid the problem of audio quality degradation caused by using constant padding. - **Realize a fully - streaming acoustic model decoder**: Through the chunk - streaming acoustic model decoder, capture frame - level acoustic features and implement the streaming paradigm throughout the system, thereby avoiding the quadratic time complexity and computational cost brought by the attention mechanism. ### Main contributions - **First attempt to implement streaming SVS within the CVAE framework**: Support sequential generation across chunks and allow parallel computation within each chunk. - **Solve the problem of using latent representations as the input of the causal streaming vocoder**: Significantly improve audio quality through the natural padding strategy. - **Realize a fully - streaming acoustic model decoder**: The final model is comparable to the parallel baseline system in subjective evaluation while having the lowest latency. These improvements make CSSinger outperform or be on a par with the parallel baseline system in both subjective and objective metrics on two Chinese singing voice datasets and one TTS dataset, and significantly reduce the latency.