A Theory-Based Explainable Deep Learning Architecture for Music Emotion

Hortense Fong,Vineet Kumar,K. Sudhir
2024-08-14
Abstract:This paper paper develops a theory-based, explainable deep learning convolutional neural network (CNN) classifier to predict the time-varying emotional response to music. We design novel CNN filters that leverage the frequency harmonics structure from acoustic physics known to impact the perception of musical features. Our theory-based model is more parsimonious, but provides comparable predictive performance to atheoretical deep learning models, while performing better than models using handcrafted features. Our model can be complemented with handcrafted features, but the performance improvement is marginal. Importantly, the harmonics-based structure placed on the CNN filters provides better explainability for how the model predicts emotional response (valence and arousal), because emotion is closely related to consonance--a perceptual feature defined by the alignment of harmonics. Finally, we illustrate the utility of our model with an application involving digital advertising. Motivated by YouTube mid-roll ads, we conduct a lab experiment in which we exogenously insert ads at different times within videos. We find that ads placed in emotionally similar contexts increase ad engagement (lower skip rates, higher brand recall rates). Ad insertion based on emotional similarity metrics predicted by our theory-based, explainable model produces comparable or better engagement relative to atheoretical models.
Sound,Artificial Intelligence,Human-Computer Interaction,Audio and Speech Processing
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is to develop a theory - based and interpretable deep - learning convolutional neural network (CNN) classifier to predict the emotional responses evoked by music over time. Specifically, this research aims to improve the accuracy and interpretability of the model in emotional prediction by combining music theory and deep - learning techniques. ### Main Problem Decomposition 1. **Accuracy of Emotional Prediction**: - Existing music emotion classifiers are usually divided into two categories: traditional machine - learning methods using hand - designed features (more interpretable but less accurate), and theory - free deep - learning methods using data - driven variables (more accurate but difficult to interpret). This paper proposes a new framework, attempting to improve the interpretability of the model while maintaining high accuracy. 2. **Interpretability of the Model**: - To make the model more interpretable, the author introduced CNN filters based on harmonic structures. These filters can capture music features closely related to emotions, such as consonance. In addition, the author also uses visualization tools such as Grad - CAM to show how the model makes emotional predictions based on music features. 3. **Value in Practical Applications**: - This research shows the application value of its model in digital advertising, especially YouTube's mid - roll ads. By matching the emotion of the ad with the emotion of the video content, the effectiveness of the ad can be significantly improved (for example, reducing the skip rate and increasing the brand recall rate). ### Key Contributions 1. **Combination of Theoretical Basis and Interpretability**: - The author designed CNN filters based on harmonic structures, enabling the model to not only capture key features in music but also explain how these features affect emotional prediction. 2. **Balance between Performance and Simplicity**: - Although the model is more concise, its prediction performance is comparable to, or even better than, theory - free deep - learning models. 3. **Verification of Practical Application Scenarios**: - The effectiveness of using an emotion - matching strategy in ad insertion has been verified through experiments, proving the value of this model in practical applications. ### Summary This paper develops an accurate and interpretable music emotion prediction model by combining music theory and deep - learning techniques, and shows its potential for practical applications in digital advertising. This method not only improves the accuracy of emotional prediction but also enhances the transparency and credibility of the model, which helps to promote the development of the music emotion analysis field.