Abstract:This paper paper develops a theory-based, explainable deep learning convolutional neural network (CNN) classifier to predict the time-varying emotional response to music. We design novel CNN filters that leverage the frequency harmonics structure from acoustic physics known to impact the perception of musical features. Our theory-based model is more parsimonious, but provides comparable predictive performance to atheoretical deep learning models, while performing better than models using handcrafted features. Our model can be complemented with handcrafted features, but the performance improvement is marginal. Importantly, the harmonics-based structure placed on the CNN filters provides better explainability for how the model predicts emotional response (valence and arousal), because emotion is closely related to consonance--a perceptual feature defined by the alignment of harmonics. Finally, we illustrate the utility of our model with an application involving digital advertising. Motivated by YouTube mid-roll ads, we conduct a lab experiment in which we exogenously insert ads at different times within videos. We find that ads placed in emotionally similar contexts increase ad engagement (lower skip rates, higher brand recall rates). Ad insertion based on emotional similarity metrics predicted by our theory-based, explainable model produces comparable or better engagement relative to atheoretical models.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is to develop a theory - based and interpretable deep - learning convolutional neural network (CNN) classifier to predict the emotional responses evoked by music over time. Specifically, this research aims to improve the accuracy and interpretability of the model in emotional prediction by combining music theory and deep - learning techniques. ### Main Problem Decomposition 1. **Accuracy of Emotional Prediction**: - Existing music emotion classifiers are usually divided into two categories: traditional machine - learning methods using hand - designed features (more interpretable but less accurate), and theory - free deep - learning methods using data - driven variables (more accurate but difficult to interpret). This paper proposes a new framework, attempting to improve the interpretability of the model while maintaining high accuracy. 2. **Interpretability of the Model**: - To make the model more interpretable, the author introduced CNN filters based on harmonic structures. These filters can capture music features closely related to emotions, such as consonance. In addition, the author also uses visualization tools such as Grad - CAM to show how the model makes emotional predictions based on music features. 3. **Value in Practical Applications**: - This research shows the application value of its model in digital advertising, especially YouTube's mid - roll ads. By matching the emotion of the ad with the emotion of the video content, the effectiveness of the ad can be significantly improved (for example, reducing the skip rate and increasing the brand recall rate). ### Key Contributions 1. **Combination of Theoretical Basis and Interpretability**: - The author designed CNN filters based on harmonic structures, enabling the model to not only capture key features in music but also explain how these features affect emotional prediction. 2. **Balance between Performance and Simplicity**: - Although the model is more concise, its prediction performance is comparable to, or even better than, theory - free deep - learning models. 3. **Verification of Practical Application Scenarios**: - The effectiveness of using an emotion - matching strategy in ad insertion has been verified through experiments, proving the value of this model in practical applications. ### Summary This paper develops an accurate and interpretable music emotion prediction model by combining music theory and deep - learning techniques, and shows its potential for practical applications in digital advertising. This method not only improves the accuracy of emotional prediction but also enhances the transparency and credibility of the model, which helps to promote the development of the music emotion analysis field.

A Theory-Based Explainable Deep Learning Architecture for Music Emotion

A Theory-Based Interpretable Deep Learning Architecture for Music Emotion

BLNN:a muscular and tall architecture for emotion prediction in music

Music Emotion Prediction Using Recurrent Neural Networks

Modeling of the Latent Embedding of Music using Deep Neural Network

Deep learning-based late fusion of multimodal information for emotion classification of music video

Research on Music Emotional Expression Based on Reinforcement Learning and Multimodal Information

Classifying Emotions in Film Music—A Deep Learning Approach

Music emotion recognition based on temporal convolutional attention network using EEG

Music content personalized recommendation system based on a convolutional neural network

Music Emotion Recognition Based on a Neural Network with an Inception-GRU Residual Structure

Music recommendation based on affective image content analysis

Design of Deep Learning Network Model for Personalized Music Emotional Recommendation

A Deep Bidirectional Long Short-Term Memory Based Multi-Scale Approach for Music Dynamic Emotion Prediction

Toward end-to-end interpretable convolutional neural networks for waveform signals

Emotional Video to Audio Transformation Using Deep Recurrent Neural Networks and a Neuro-Fuzzy System

Hybrid Deep Learning Approach to Emotion-Infused Music Recommendation

Tracing Back Music Emotion Predictions to Sound Sources and Intuitive Perceptual Qualities

A Deep Neural Network for Modeling Music

Music aesthetic teaching and emotional visualization under emotional teaching theory and deep learning

Music emotion recognition using deep convolutional neural networks