Abstract:In this paper, we develop DeepSinger, a multi-lingual multi-singer singing voice synthesis (SVS) system, which is built from scratch using singing training data mined from music websites. The pipeline of DeepSinger consists of several steps, including data crawling, singing and accompaniment separation, lyrics-to-singing alignment, data filtration, and singing modeling. Specifically, we design a lyrics-to-singing alignment model to automatically extract the duration of each phoneme in lyrics starting from coarse-grained sentence level to fine-grained phoneme level, and further design a multi-lingual multi-singer singing model based on a feed-forward Transformer to directly generate linear-spectrograms from lyrics, and synthesize voices using Griffin-Lim. DeepSinger has several advantages over previous SVS systems: 1) to the best of our knowledge, it is the first SVS system that directly mines training data from music websites, 2) the lyrics-to-singing alignment model further avoids any human efforts for alignment labeling and greatly reduces labeling cost, 3) the singing model based on a feed-forward Transformer is simple and efficient, by removing the complicated acoustic feature modeling in parametric synthesis and leveraging a reference encoder to capture the timbre of a singer from noisy singing data, and 4) it can synthesize singing voices in multiple languages and multiple singers. We evaluate DeepSinger on our mined singing dataset that consists of about 92 hours data from 89 singers on three languages (Chinese, Cantonese and English). The results demonstrate that with the singing data purely mined from the Web, DeepSinger can synthesize high-quality singing voices in terms of both pitch accuracy and voice naturalness (footnote: Our audio samples are shown in <a class="link-external link-https" href="https://speechresearch.github.io/deepsinger/" rel="external noopener nofollow">this https URL</a>.)

Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and ACE-KiSing

SingAug: Data Augmentation for Singing Voice Synthesis with Cycle-consistent Training Strategy

Muskits-ESPnet: A Comprehensive Toolkit for Singing Voice Synthesis in New Paradigm

SingMOS: An extensive Open-Source Singing Voice Dataset for MOS Prediction

VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation

DeepSinger: Singing Voice Synthesis with Data Mined From the Web

Constructing a Singing Style Caption Dataset

GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks

Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion

SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice Conversion

Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus

RMSSinger: Realistic-Music-Score based Singing Voice Synthesis

VITS-based Singing Voice Conversion System with DSPGAN post-processing for SVCC2023

Muskits: an End-to-End Music Processing Toolkit for Singing Voice Synthesis.

A Preliminary Investigation on Flexible Singing Voice Synthesis Through Decomposed Framework with Inferrable Features

CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

Robust Singing Voice Transcription Serves Synthesis

A Comparative Study of Voice Conversion Models with Large-Scale Speech and Singing Data: The T13 Systems for the Singing Voice Conversion Challenge 2023

MR-SVS: Singing Voice Synthesis with Multi-Reference Encoder

A Systematic Exploration of Joint-training for Singing Voice Synthesis

Deep Audio-Visual Singing Voice Transcription based on Self-Supervised Learning Models