Abstract:We are interested in a challenging task, Realistic-Music-Score based Singing Voice Synthesis (RMS-SVS). RMS-SVS aims to generate high-quality singing voices given realistic music scores with different note types (grace, slur, rest, etc.). Though significant progress has been achieved, recent singing voice synthesis (SVS) methods are limited to fine-grained music scores, which require a complicated data collection pipeline with time-consuming manual annotation to align music notes with phonemes. Furthermore, these manual annotation destroys the regularity of note durations in music scores, making fine-grained music scores inconvenient for composing. To tackle these challenges, we propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input, eliminating most of the tedious manual annotation and avoiding the aforementioned inconvenience. Note that music scores are based on words rather than phonemes, in RMSSinger, we introduce word-level modeling to avoid the time-consuming phoneme duration annotation and the complicated phoneme-level mel-note alignment. Furthermore, we propose the first diffusion-based pitch modeling method, which ameliorates the naturalness of existing pitch-modeling methods. To achieve these, we collect a new dataset containing realistic music scores and singing voices according to these realistic music scores from professional singers. Extensive experiments on the dataset demonstrate the effectiveness of our methods. Audio samples are available at <a class="link-external link-https" href="https://rmssinger.github.io/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the existing Singing Voice Synthesis (SVS) methods rely on fine - grained music scores, which require a complex and time - consuming manual annotation process, including phoneme annotation, note annotation and silence annotation. These manual annotations not only take a great deal of time, but also disrupt the regularity of note durations in the music score, making the fine - grained music score inconvenient to use in the composition process. To solve these problems, the paper proposes a new singing voice synthesis method based on real music scores - RMSSinger. RMSSinger aims to directly use real music scores containing different note types (such as grace notes, slurs, rests, etc.) as input to generate high - quality singing voices, thereby reducing most of the cumbersome manual annotation work and maintaining the naturalness and ease of use of the music score. ### Specific problems: 1. **Limitations of existing SVS methods**: - Require fine - grained music scores, which demand a complex annotation process. - Manual adjustment of note durations will disrupt the regularity of the music score and affect the convenience of composition. 2. **Challenges in data collection**: - The data collection pipeline for fine - grained music scores is complex and requires a large amount of professional manual annotation. - These annotation steps, especially phoneme and note annotation, are very time - consuming and highly professional. 3. **Deficiencies in pitch modeling**: - Existing methods mainly use simple L1 or L2 losses for pitch modeling, resulting in insufficient expressive ability. ### RMSSinger's solutions: - **SVS based on real music scores**: RMSSinger can directly use real music scores as input, avoiding cumbersome manual annotation. - **Word - level modeling**: Introduce the word - level position attention mechanism and the word - level learning Gaussian up - sampler to avoid phoneme duration annotation and phoneme - level note alignment. - **Diffusion model pitch modeling**: Propose the first pitch generation method based on the diffusion model (P - DDPM), which simultaneously processes continuous F0 and discrete UV, improving the naturalness and expressiveness of pitch modeling. Through these innovations, RMSSinger not only simplifies the data collection process, but also improves the quality and naturalness of singing voice synthesis.

RMSSinger: Realistic-Music-Score based Singing Voice Synthesis

RealSinger: Ultra-realistic singing voice generation via stochastic differential equations

MR-SVS: Singing Voice Synthesis with Multi-Reference Encoder

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

RDSinger: Reference-based Diffusion Network for Singing Voice Synthesis

Robust Singing Voice Transcription Serves Synthesis

UniSinger: Unified End-to-End Singing Voice Synthesis with Cross-Modality Information Matching

GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks

DeepSinger: Singing Voice Synthesis with Data Mined From the Web

HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis

Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and ACE-KiSing

TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control

SiFiSinger: A High-Fidelity End-to-End Singing Voice Synthesizer based on Source-filter Model

StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis

RMVPE: A Robust Model for Vocal Pitch Estimation in Polyphonic Music

SingMOS: An extensive Open-Source Singing Voice Dataset for MOS Prediction

A Preliminary Investigation on Flexible Singing Voice Synthesis Through Decomposed Framework with Inferrable Features

Enhancing the vocal range of single-speaker singing voice synthesis with melody-unsupervised pre-training

Mandarin Singing Voice Synthesis Based on Harmonic Plus Noise Model and Singing Expression Analysis

Learning the Beauty in Songs: Neural Singing Voice Beautifier

SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model