Abstract:In recent years, speech emotion recognition (SER) increasingly attracts attention since it is a key component of intelligent human-computer interaction and sophisticated dialog systems. To obtain more abundant emotional information, a great number of studies in SER pay attention to the multimodal systems which utilize other modalities such as text and facial expression to assist the speech emotion recognition. However, it is difficult to structure a fusion mechanism which can selectively extract abundant emotion-related features from different modalities. To tackle this issue, we develop a multimodal speech emotion recognition model based on multi-scale MFCCs and multi-view attention mechanism, which is able to extract abundant audio emotional features and comprehensively fuse emotion-related features from four aspects (i.e., audio self-attention, textual self-attention, audio attention based on textual content, and textual attention based on audio content). Under different audio input conditions and attention configurations, it can be observed that the best emotion recognition accuracy can be achieved by jointly utilizing four attention modules and three different scales of MFCCs. In addition, based on multi-task learning, we regard the gender recognition as an auxiliary task to learn gender information. To further improve the accuracy of emotion recognition, a joint loss function based on softmax cross-entropy loss and center loss is used. The experiments are conducted on two different datasets (IEMOCAP and MSP-IMPROV). The experimental results demonstrate that the proposed model outperforms the previous models on IEMOCAP dataset, while it obtains the competitive performance on MSP-IMPROV dataset.

Multimodal Emotion Recognition Based on Multilevel Acoustic and Textual Information

Emotion Recognition in Videos via Fusing Multimodal Features.

Multimodal Emotion Recognition Based on Feature Fusion.

Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features

Multimodal Emotion Recognition by Extracting Common and Modality-Specific Information.

Multi-modal Emotion Recognition Based on Speech and Image.

Multimodal emotion recognition from facial expression and speech based on feature fusion

Multimodal Speech Emotion Recognition Based on Multi-Scale MFCCs and Multi-View Attention Mechanism

Multi-Modal Emotion Recognition Based on Wavelet Transform and BERT-RoBERTa: An Innovative Approach Combining Enhanced BiLSTM and Focus Loss Function

A Multi-Level Circulant Cross-Modal Transformer for Multimodal Speech Emotion Recognition

Emotion Recognition with Multimodal Transformer Fusion Framework Based on Acoustic and Lexical Information

Multimodal Emotion Recognition Based on Feature Selection and Extreme Learning Machine in Video Clips.

Multimodal transformer augmented fusion for speech emotion recognition

Multi-modal Emotion Recognition Based on Deep Learning in Speech, Video and Text

Multi-head attention fusion networks for multi-modal speech emotion recognition

Multimodal Emotion Recognition based on the Fusion of EEG Signals and Eye Movement Data

Multimodal Speech Emotion Recognition Using Audio and Text

A Three-stage multimodal emotion recognition network based on text low-rank fusion

An Improved Multimodal Dimension Emotion Recognition Based on Different Fusion Methods

A Novel Dual-Modal Emotion Recognition Algorithm with Fusing Hybrid Features of Audio Signal and Speech Context

Multimodal modelling of human emotion using sound, image and text fusion