Automatic Estimation of Singing Voice Musical Dynamics

Jyoti Narang,Nazif Can Tamer,Viviana De La Vega,Xavier Serra
2024-10-28
Abstract:Musical dynamics form a core part of expressive singing voice performances. However, automatic analysis of musical dynamics for singing voice has received limited attention partly due to the scarcity of suitable datasets and a lack of clear evaluation frameworks. To address this challenge, we propose a methodology for dataset curation. Employing the proposed methodology, we compile a dataset comprising 509 musical dynamics annotated singing voice performances, aligned with 163 score files, leveraging state-of-the-art source separation and alignment techniques. The scores are sourced from the OpenScore Lieder corpus of romantic-era compositions, widely known for its wealth of expressive annotations. Utilizing the curated dataset, we train a multi-head attention based CNN model with varying window sizes to evaluate the effectiveness of estimating musical dynamics. We explored two distinct perceptually motivated input representations for the model training: log-Mel spectrum and bark-scale based features. For testing, we manually curate another dataset of 25 musical dynamics annotated performances in collaboration with a professional vocalist. We conclude through our experiments that bark-scale based features outperform log-Mel-features for the task of singing voice dynamics prediction. The dataset along with the code is shared publicly for further research on the topic.
Sound,Information Retrieval,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the automatic analysis and estimation of musical dynamics in singing performances. Specifically, the researchers focus on how to effectively evaluate the musical dynamics in singing voices through automated methods. The importance of this problem is reflected in the following aspects: 1. **Scarcity of datasets**: The existing datasets suitable for analyzing musical dynamics in singing voices are very limited, which restricts the research progress in related fields. 2. **Lack of a clear evaluation framework**: At present, there are no clear and unified evaluation criteria to measure the performance of automatic analysis systems, making it difficult to compare the effectiveness of different methods. 3. **Subjectivity of musical dynamic annotation**: The annotation process of musical dynamics is highly dependent on the subjective judgment of the listener. Even for the same piece of music, different listeners may have different interpretations. To solve the above problems, the author proposes the following key tasks: - **Dataset construction**: Utilize the state - of - the - art sound source separation and alignment techniques to organize 509 singing performance data with musical dynamic annotations from the OpenScore Lieder corpus and align them with 163 score files. - **Model training and evaluation**: Based on the organized dataset, train a convolutional neural network (CNN) with a multi - head attention mechanism and use two perception - driven input representations (log - Mel spectrum and Bark - scale features) to evaluate its effectiveness in predicting musical dynamics. - **Test dataset**: To further verify the performance of the model, manually organize another performance dataset containing 25 musical dynamic annotations. These performances are provided by professional singers. The final experimental results show that in the task of singing voice dynamic prediction, the Bark - scale features perform better than the log - Mel features. Especially in a larger time window, the model can better capture the changes in musical dynamics. ### Key formulas and symbol explanations - **log - Mel spectrum**: \[ \text{log - Mel}(t, f)=\log(1 + 1000\times\frac{\text{Mel}(f)}{f_{\text{max}}}) \] where \( t \) represents time, \( f \) represents frequency, and \( f_{\text{max}} \) is the maximum frequency. - **Bark - scale feature**: \[ \text{Bark}(f)=13\arctan(0.00076f)+3.5\arctan((f / 7500)^2) \] where \( f \) represents frequency. Through these tasks, the author not only fills the gap in the field of automatic analysis of musical dynamics in singing voices but also provides valuable resources and methodological support for future research.