Abstract:Singing Voice Synthesis (SVS) aims to generate singing voices of high fidelity and expressiveness. Conventional SVS systems usually utilize an acoustic model to transform a music score into acoustic features, followed by a vocoder to reconstruct the singing voice. It was recently shown that end-to-end modeling is effective in the fields of SVS and Text to Speech (TTS). In this work, we thus present a fully end-to-end SVS method together with a chunkwise streaming inference to address the latency issue for practical usages. Note that this is the first attempt to fully implement end-to-end streaming audio synthesis using latent representations in VAE. We have made specific improvements to enhance the performance of streaming SVS using latent representations. Experimental results demonstrate that the proposed method achieves synthesized audio with high expressiveness and pitch accuracy in both streaming SVS and TTS tasks.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve two main problems in Singing Voice Synthesis (SVS): 1. **High - fidelity and expressive singing voice generation**: Traditional SVS systems usually adopt a two - stage approach, that is, first convert the musical score into acoustic features through an acoustic model, and then reconstruct the singing voice through a vocoder. This method has a significant computational burden when dealing with long sequences and has poor real - time performance. In addition, end - to - end modeling has been proven to be very effective in the fields of SVS and Text to Speech (TTS). 2. **Low - latency real - time audio synthesis**: Existing high - performance models face limitations in computational resources and latency requirements in practical applications, especially in edge devices or online network services. Therefore, a method that can handle audio synthesis tasks in a streaming and autoregressive manner is needed to meet the requirements of practical application scenarios. ### Specific solutions To address the above challenges, the authors propose a brand - new end - to - end chunk - streaming singing voice synthesis system named **ChunkStreamSinger (CSSinger)**. This system is based on the Conditional Variational Autoencoder (CVAE) and has the following innovations: - **First realization of end - to - end streaming audio synthesis**: Use latent representations for streaming audio synthesis, which solves the latency problem existing in traditional methods. - **Improve the input problem of the causal streaming vocoder**: Introduce the Natural Padding strategy to avoid the problem of audio quality degradation caused by using constant padding. - **Realize a fully - streaming acoustic model decoder**: Through the chunk - streaming acoustic model decoder, capture frame - level acoustic features and implement the streaming paradigm throughout the system, thereby avoiding the quadratic time complexity and computational cost brought by the attention mechanism. ### Main contributions - **First attempt to implement streaming SVS within the CVAE framework**: Support sequential generation across chunks and allow parallel computation within each chunk. - **Solve the problem of using latent representations as the input of the causal streaming vocoder**: Significantly improve audio quality through the natural padding strategy. - **Realize a fully - streaming acoustic model decoder**: The final model is comparable to the parallel baseline system in subjective evaluation while having the lowest latency. These improvements make CSSinger outperform or be on a par with the parallel baseline system in both subjective and objective metrics on two Chinese singing voice datasets and one TTS dataset, and significantly reduce the latency.

CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder

SiFiSinger: A High-Fidelity End-to-End Singing Voice Synthesizer based on Source-filter Model

VISinger: Variational Inference with Adversarial Learning for End-to-End Singing Voice Synthesis

VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders

Period Singer: Integrating Periodic and Aperiodic Variational Autoencoders for Natural-Sounding End-to-End Singing Voice Synthesis

UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis

Towards Improving the Expressiveness of Singing Voice Synthesis with BERT Derived Semantic Information

TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control

An Empirical Study on End-to-End Singing Voice Synthesis with Encoder-Decoder Architectures

MR-SVS: Singing Voice Synthesis with Multi-Reference Encoder

A Systematic Exploration of Joint-training for Singing Voice Synthesis

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

UniSinger: Unified End-to-End Singing Voice Synthesis with Cross-Modality Information Matching

RealSinger: Ultra-realistic singing voice generation via stochastic differential equations

VITS-based Singing Voice Conversion System with DSPGAN post-processing for SVCC2023

HiddenSinger: High-Quality Singing Voice Synthesis via Neural Audio Codec and Latent Diffusion Models

Singing-Tacotron

Learn to Sing by Listening: Building Controllable Virtual Singer by Unsupervised Learning from Voice Recordings

HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis