Abstract:In this paper, we develop DeepSinger, a multi-lingual multi-singer singing voice synthesis (SVS) system, which is built from scratch using singing training data mined from music websites. The pipeline of DeepSinger consists of several steps, including data crawling, singing and accompaniment separation, lyrics-to-singing alignment, data filtration, and singing modeling. Specifically, we design a lyrics-to-singing alignment model to automatically extract the duration of each phoneme in lyrics starting from coarse-grained sentence level to fine-grained phoneme level, and further design a multi-lingual multi-singer singing model based on a feed-forward Transformer to directly generate linear-spectrograms from lyrics, and synthesize voices using Griffin-Lim. DeepSinger has several advantages over previous SVS systems: 1) to the best of our knowledge, it is the first SVS system that directly mines training data from music websites, 2) the lyrics-to-singing alignment model further avoids any human efforts for alignment labeling and greatly reduces labeling cost, 3) the singing model based on a feed-forward Transformer is simple and efficient, by removing the complicated acoustic feature modeling in parametric synthesis and leveraging a reference encoder to capture the timbre of a singer from noisy singing data, and 4) it can synthesize singing voices in multiple languages and multiple singers. We evaluate DeepSinger on our mined singing dataset that consists of about 92 hours data from 89 singers on three languages (Chinese, Cantonese and English). The results demonstrate that with the singing data purely mined from the Web, DeepSinger can synthesize high-quality singing voices in terms of both pitch accuracy and voice naturalness (footnote: Our audio samples are shown in <a class="link-external link-https" href="https://speechresearch.github.io/deepsinger/" rel="external noopener nofollow">this https URL</a>.)

HiddenSinger: High-Quality Singing Voice Synthesis via Neural Audio Codec and Latent Diffusion Models

A Deep Learning Based Analysis-Synthesis Framework For Unison Singing

HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

RDSinger: Reference-based Diffusion Network for Singing Voice Synthesis

SingGAN: Generative Adversarial Network for High-Fidelity Singing Voice Generation

InstructSing: High-Fidelity Singing Voice Generation via Instructing Yourself

HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation

SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model

RealSinger: Ultra-realistic singing voice generation via stochastic differential equations

ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders

DeepSinger: Singing Voice Synthesis with Data Mined From the Web

Xiaoicesing 2: A High-Fidelity Singing Voice Synthesizer Based on Generative Adversarial Network

Adversarial Multi-Task Learning for Disentangling Timbre and Pitch in Singing Voice Synthesis

An Empirical Study on End-to-End Singing Voice Synthesis with Encoder-Decoder Architectures

Learning Singing From Speech

LDM-SVC: Latent Diffusion Model Based Zero-Shot Any-to-Any Singing Voice Conversion with Singer Guidance

Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling

Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus

ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps

Learn to Sing by Listening: Building Controllable Virtual Singer by Unsupervised Learning from Voice Recordings