Abstract:Intangible cultural heritage (ICH) songs convey folk lives and stories from different communities and nations through touching melodies and lyrics, which are rich in sentiments. Currently, researches about the sentiment analysis of songs are mainly based on lyrics, audios and lyric-audio. Recent studies have shown that deep spectrum features extracted from the spectrogram, generated from the audio, perform well in several speech-based tasks. However, studies combining spectrum features in multimodal sentiment analysis of songs are in a lack. Hence, we propose to combine the audio, lyric and spectrogram to conduct multimodal sentiment analysis for ICH songs, in a tri-modal fusion way. In addition, the correlations and interactions between different modalities are not considered fully. Here, we propose a multimodal song sentiment analysis model (MSSAM), including a strengthened audio features-guided attention (SAFGA) mechanism, which can learn intra- and inter-modal information effectively. First, we obtain strengthened audio features through the fusion of acoustic and spectrum features. Then, the strengthened audio features are used to guide the attention weights distribution of words in the lyric with help of SAFGA, which can make the model focus on the important words with sentiments and related with the sentiment of strengthened audio features, capturing modal interactions and complementary information. We take two world-level ICH lists, Jingju and Kunqu, as examples, and build sentiment analysis datasets. We compare the proposed model with other state-of-the-arts baselines in Jingju and Kunqu datasets. Experimental results demonstrate the superiority of our proposed model.

A Bimodal-based Algorithm for Song Sentiment Classification

Automatic Music Emotion Classification Using a New Classification Algorithm

Modality-invariant Temporal Representation Learning for Multimodal Sentiment Classification

Graph-Based Multimodal Music Mood Classification in Discriminative Latent Space.

Multimodal Music Mood Classification by Fusion of Audio and Lyrics.

Bimodal Emotion Recognition Model for Minnan Songs

Enhancing Music Mood Recognition with LLMs and Audio Signal Processing: A Multimodal Approach

Improve the application of reinforcement learning and multi‐modal information in music sentiment analysis

Boosting for Multi-Modal Music Emotion Classification.

Multi-Modal Music Mood Classification Using Co-Training

Automatic Music Mood Classification by Learning Cross-Media Relevance Between Audio and Lyrics

A Comparative Study of Teaching Effectiveness in Emotionally Empowered Music Classrooms from a Multimodal Perspective

Lyric-based Song Emotion Detection with Affective Lexicon and Fuzzy Clustering Method.

Multimodal Music Emotion Recognition with Hierarchical Cross-Modal Attention Network

Evaluating Sentiment Similarity Of Songs Based On Social Media Data

Video Sentiment Analysis with Bimodal Information-augmented Multi-Head Attention

Multimodal Sentiment Analysis of Intangible Cultural Heritage Songs with Strengthened Audio Features-Guided Attention

MERGE -- A Bimodal Dataset for Static Music Emotion Recognition

Emotion Analysis of Songs Based on Lyrical and Audio Features

A Novel Dual-Modal Emotion Recognition Algorithm with Fusing Hybrid Features of Audio Signal and Speech Context

Multimodal Emotion Recognition and Sentiment Analysis via Attention Enhanced Recurrent Model