Joint Learning of Emotions in Music and Generalized Sounds

Federico Simonetta,Francesca Certo,Stavros Ntalampiras
DOI: https://doi.org/10.1145/3678299.3678328
2024-08-14
Abstract:In this study, we aim to determine if generalized sounds and music can share a common emotional space, improving predictions of emotion in terms of arousal and valence. We propose the use of multiple datasets as a multi-domain learning technique. Our approach involves creating a common space encompassing features that characterize both generalized sounds and music, as they can evoke emotions in a similar manner. To achieve this, we utilized two publicly available datasets, namely IADS-E and PMEmo, following a standardized experimental protocol. We employed a wide variety of features that capture diverse aspects of the audio structure including key parameters of spectrum, energy, and voicing. Subsequently, we performed joint learning on the common feature space, leveraging heterogeneous model architectures. Interestingly, this synergistic scheme outperforms the state-of-the-art in both sound and music emotion prediction. The code enabling full replication of the presented experimental pipeline is available at <a class="link-external link-https" href="https://github.com/LIMUNIMI/MusicSoundEmotions" rel="external noopener nofollow">this https URL</a>.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **To determine whether generalized sounds and music can share a common emotional space in order to improve the accuracy of emotion prediction, especially in the arousal and valence dimensions**. Specifically, the authors hope to verify the following hypotheses: 1. **Whether generalized sounds and music can be modeled in a common emotional space**: By creating a common space containing the features of generalized sounds and music, so that both can jointly train the model. 2. **Whether this multi - domain learning method can improve the performance of emotion prediction**: Especially for the prediction of the two emotion dimensions of arousal and valence, whether it can surpass the existing methods. To achieve this goal, the authors used multiple datasets (such as IADS - E and PMEmo), and carried out joint learning using the audio features in these datasets. They extracted a variety of audio features including spectrum, energy and pronunciation, and used different model architectures (such as linear models, support vector regression and support vector machines, etc.) for experiments. Eventually, they found that this method performs excellently in emotion prediction, especially achieving a significant improvement in arousal prediction. ### Key Contributions 1. **Proposed a new multi - modal learning strategy**, which combines different types of audio data (music and generalized sounds) for audio emotion recognition (AER). 2. **Developed new models**, which surpass the existing techniques in the emotion recognition of music and environmental sounds. 3. **Analyzed in detail the impact of the proposed enhancement strategy on different types of sounds**, showing its potential in practical applications. ### Method Overview - **Datasets**: Used two datasets, IADS - E and PMEmo, which cover generalized sounds and music respectively. - **Feature Extraction**: Used the openSMILE toolkit to extract 6375 static features, covering multiple aspects such as spectrum, energy and pronunciation. - **Model Selection and Validation**: Used multiple models such as ElasticNet, support vector regression (SVR) and AutoML, and evaluated the model performance through 5 - fold cross - validation. Through these methods, the authors have successfully proven that generalized sounds and music can be effectively modeled in a common emotional space, and this multi - domain learning method can significantly improve the accuracy of emotion prediction.