RMSSinger: Realistic-Music-Score based Singing Voice Synthesis

Jinzheng He,Jinglin Liu,Zhenhui Ye,Rongjie Huang,Chenye Cui,Huadai Liu,Zhou Zhao
2023-05-18
Abstract:We are interested in a challenging task, Realistic-Music-Score based Singing Voice Synthesis (RMS-SVS). RMS-SVS aims to generate high-quality singing voices given realistic music scores with different note types (grace, slur, rest, etc.). Though significant progress has been achieved, recent singing voice synthesis (SVS) methods are limited to fine-grained music scores, which require a complicated data collection pipeline with time-consuming manual annotation to align music notes with phonemes. Furthermore, these manual annotation destroys the regularity of note durations in music scores, making fine-grained music scores inconvenient for composing. To tackle these challenges, we propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input, eliminating most of the tedious manual annotation and avoiding the aforementioned inconvenience. Note that music scores are based on words rather than phonemes, in RMSSinger, we introduce word-level modeling to avoid the time-consuming phoneme duration annotation and the complicated phoneme-level mel-note alignment. Furthermore, we propose the first diffusion-based pitch modeling method, which ameliorates the naturalness of existing pitch-modeling methods. To achieve these, we collect a new dataset containing realistic music scores and singing voices according to these realistic music scores from professional singers. Extensive experiments on the dataset demonstrate the effectiveness of our methods. Audio samples are available at <a class="link-external link-https" href="https://rmssinger.github.io/" rel="external noopener nofollow">this https URL</a>.
Sound,Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the existing Singing Voice Synthesis (SVS) methods rely on fine - grained music scores, which require a complex and time - consuming manual annotation process, including phoneme annotation, note annotation and silence annotation. These manual annotations not only take a great deal of time, but also disrupt the regularity of note durations in the music score, making the fine - grained music score inconvenient to use in the composition process. To solve these problems, the paper proposes a new singing voice synthesis method based on real music scores - RMSSinger. RMSSinger aims to directly use real music scores containing different note types (such as grace notes, slurs, rests, etc.) as input to generate high - quality singing voices, thereby reducing most of the cumbersome manual annotation work and maintaining the naturalness and ease of use of the music score. ### Specific problems: 1. **Limitations of existing SVS methods**: - Require fine - grained music scores, which demand a complex annotation process. - Manual adjustment of note durations will disrupt the regularity of the music score and affect the convenience of composition. 2. **Challenges in data collection**: - The data collection pipeline for fine - grained music scores is complex and requires a large amount of professional manual annotation. - These annotation steps, especially phoneme and note annotation, are very time - consuming and highly professional. 3. **Deficiencies in pitch modeling**: - Existing methods mainly use simple L1 or L2 losses for pitch modeling, resulting in insufficient expressive ability. ### RMSSinger's solutions: - **SVS based on real music scores**: RMSSinger can directly use real music scores as input, avoiding cumbersome manual annotation. - **Word - level modeling**: Introduce the word - level position attention mechanism and the word - level learning Gaussian up - sampler to avoid phoneme duration annotation and phoneme - level note alignment. - **Diffusion model pitch modeling**: Propose the first pitch generation method based on the diffusion model (P - DDPM), which simultaneously processes continuous F0 and discrete UV, improving the naturalness and expressiveness of pitch modeling. Through these innovations, RMSSinger not only simplifies the data collection process, but also improves the quality and naturalness of singing voice synthesis.