Abstract:Musical dynamics form a core part of expressive singing voice performances. However, automatic analysis of musical dynamics for singing voice has received limited attention partly due to the scarcity of suitable datasets and a lack of clear evaluation frameworks. To address this challenge, we propose a methodology for dataset curation. Employing the proposed methodology, we compile a dataset comprising 509 musical dynamics annotated singing voice performances, aligned with 163 score files, leveraging state-of-the-art source separation and alignment techniques. The scores are sourced from the OpenScore Lieder corpus of romantic-era compositions, widely known for its wealth of expressive annotations. Utilizing the curated dataset, we train a multi-head attention based CNN model with varying window sizes to evaluate the effectiveness of estimating musical dynamics. We explored two distinct perceptually motivated input representations for the model training: log-Mel spectrum and bark-scale based features. For testing, we manually curate another dataset of 25 musical dynamics annotated performances in collaboration with a professional vocalist. We conclude through our experiments that bark-scale based features outperform log-Mel-features for the task of singing voice dynamics prediction. The dataset along with the code is shared publicly for further research on the topic.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the automatic analysis and estimation of musical dynamics in singing performances. Specifically, the researchers focus on how to effectively evaluate the musical dynamics in singing voices through automated methods. The importance of this problem is reflected in the following aspects: 1. **Scarcity of datasets**: The existing datasets suitable for analyzing musical dynamics in singing voices are very limited, which restricts the research progress in related fields. 2. **Lack of a clear evaluation framework**: At present, there are no clear and unified evaluation criteria to measure the performance of automatic analysis systems, making it difficult to compare the effectiveness of different methods. 3. **Subjectivity of musical dynamic annotation**: The annotation process of musical dynamics is highly dependent on the subjective judgment of the listener. Even for the same piece of music, different listeners may have different interpretations. To solve the above problems, the author proposes the following key tasks: - **Dataset construction**: Utilize the state - of - the - art sound source separation and alignment techniques to organize 509 singing performance data with musical dynamic annotations from the OpenScore Lieder corpus and align them with 163 score files. - **Model training and evaluation**: Based on the organized dataset, train a convolutional neural network (CNN) with a multi - head attention mechanism and use two perception - driven input representations (log - Mel spectrum and Bark - scale features) to evaluate its effectiveness in predicting musical dynamics. - **Test dataset**: To further verify the performance of the model, manually organize another performance dataset containing 25 musical dynamic annotations. These performances are provided by professional singers. The final experimental results show that in the task of singing voice dynamic prediction, the Bark - scale features perform better than the log - Mel features. Especially in a larger time window, the model can better capture the changes in musical dynamics. ### Key formulas and symbol explanations - **log - Mel spectrum**: \[ \text{log - Mel}(t, f)=\log(1 + 1000\times\frac{\text{Mel}(f)}{f_{\text{max}}}) \] where \( t \) represents time, \( f \) represents frequency, and \( f_{\text{max}} \) is the maximum frequency. - **Bark - scale feature**: \[ \text{Bark}(f)=13\arctan(0.00076f)+3.5\arctan((f / 7500)^2) \] where \( f \) represents frequency. Through these tasks, the author not only fills the gap in the field of automatic analysis of musical dynamics in singing voices but also provides valuable resources and methodological support for future research.

Automatic Estimation of Singing Voice Musical Dynamics

Deep Audio-Visual Singing Voice Transcription based on Self-Supervised Learning Models

Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and ACE-KiSing

Learning the Beauty in Songs: Neural Singing Voice Beautifier

Singing Voice Synthesis with Vibrato Modeling and Latent Energy Representation

MusicFace: Music-driven Expressive Singing Face Synthesis

ChoralSynth: Synthetic Dataset of Choral Singing

A Preliminary Investigation on Flexible Singing Voice Synthesis Through Decomposed Framework with Inferrable Features

A Deep-Learning Based Framework for Source Separation, Analysis, and Synthesis of Choral Ensembles

Transfer Learning in Vocal Education: Technical Evaluation of Limited Samples Describing Mezzo-soprano

RMSSinger: Realistic-Music-Score based Singing Voice Synthesis

Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing

DeepSinger: Singing Voice Synthesis with Data Mined From the Web

Emotion Recognition of the Singing Voice: Toward a Real-Time Analysis Tool for Singers

SingingHead: A Large-scale 4D Dataset for Singing Head Animation

Towards Improving Harmonic Sensitivity and Prediction Stability for Singing Melody Extraction

Deep Learning Approaches in Topics of Singing Information Processing

Influence of Diversified Health Elements Based on Machine Learning Technology on Pop Vocal Singing in a Cultural Fusion Environment

A Survey on Recent Deep Learning-driven Singing Voice Synthesis Systems

Improving Choral Music Separation through Expressive Synthesized Data from Sampled Instruments

Toward Expressive Singing Voice Correction: On Perceptual Validity of Evaluation Metrics for Vocal Melody Extraction