Abstract:Deep generative models have achieved significant progress in speech synthesis to date, while high-fidelity singing voice synthesis is still an open problem for its long continuous pronunciation, rich high-frequency parts, and strong expressiveness. Existing neural vocoders designed for text-to-speech cannot directly be applied to singing voice synthesis because they result in glitches and poor high-frequency reconstruction. In this work, we propose SingGAN, a generative adversarial network designed for high-fidelity singing voice synthesis. Specifically, 1) to alleviate the glitch problem in the generated samples, we propose source excitation with the adaptive feature learning filters to expand the receptive field patterns and stabilize long continuous signal generation; and 2) SingGAN introduces global and local discriminators at different scales to enrich low-frequency details and promote high-frequency reconstruction; and 3) To improve the training efficiency, SingGAN includes auxiliary spectrogram losses and sub-band feature matching penalty loss. To the best of our knowledge, SingGAN is the first work designed toward high-fidelity singing voice vocoding. Our evaluation of SingGAN demonstrates the state-of-the-art results with higher-quality (MOS 4.05) samples. Also, SingGAN enables a sample speed of 50x faster than real-time on a single NVIDIA 2080Ti GPU. We further show that SingGAN generalizes well to the mel-spectrogram inversion of unseen singers, and the end-to-end singing voice synthesis system SingGAN-SVS enjoys a two-stage pipeline to transform the music scores into expressive singing voices.

Gen-Res-Net: A Novel Generative Model for Singing Voice Separation

Multi-Band Multi-Resolution Fully Convolutional Neural Networks for Singing Voice Separation

Spectral Mapping of Singing Voices: U-Net-Assisted Vocal Segmentation

A Distinct Synthesizer Convolutional Tasnet For Singing Voice Separation

A Practical Singing Voice Detection System Based on GRU-RNN

Hybrid Y-Net Architecture for Singing Voice Separation

Blind Source Separation Based on Improved Wave-U-Net Network

Research On Singing Voice Detection Based On A Long-Term Recurrent Convolutional Network With Vocal Separation And Temporal Smoothing

SingGAN: Generative Adversarial Network for High-Fidelity Singing Voice Generation

Audiovisual Singing Voice Separation

SMRU: Split-and-Merge Recurrent-based UNet for Acoustic Echo Cancellation and Noise Suppression

Depthwise Separable Convolutions Versus Recurrent Neural Networks for Monaural Singing Voice Separation

3 directional Inception-ResUNet: Deep spatial feature learning for multichannel singing voice separation with distortion

Neural Vocoder Feature Estimation for Dry Singing Voice Separation

Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation

Comparison for Improvements of Singing Voice Detection System Based on Vocal Separation

U-NET: A Supervised Approach for Monaural Source Separation

VNet: A GAN-based Multi-Tier Discriminator Network for Speech Synthesis Vocoders

Voice and accompaniment separation in music using self-attention convolutional neural network

Unitnet: A Sequence-To-Sequence Acoustic Model For Concatenative Speech Synthesis