Abstract:In this paper, we propose MMER, a novel Multimodal Multi-task learning approach for Speech Emotion Recognition. MMER leverages a novel multimodal network based on early-fusion and cross-modal self-attention between text and acoustic modalities and solves three novel auxiliary tasks for learning emotion recognition from spoken utterances. In practice, MMER outperforms all our baselines and achieves state-of-the-art performance on the IEMOCAP benchmark. Additionally, we conduct extensive ablation studies and results analysis to prove the effectiveness of our proposed approach.

What problem does this paper attempt to address?

This paper aims to solve several key problems in Speech Emotion Recognition (SER). Specifically: 1. **Multi - modal fusion**: Most existing emotion recognition systems are unimodal and mainly rely on acoustic features. However, human emotional expressions are multimodal, including language, intonation, facial expressions, etc. Therefore, the paper proposes a multimodal method that combines information from acoustic and text modalities to more comprehensively capture emotional features. 2. **Multi - task learning**: Traditional SER methods usually only focus on the emotion classification task and ignore the potential improvement of other auxiliary tasks on model performance. The paper introduces three auxiliary tasks, namely Automatic Speech Recognition (ASR), Supervised Contrastive Learning (SCL), and Augmented Contrastive Learning (ACL). These tasks help the model learn richer representations, thereby improving the final emotion recognition performance. 3. **Fine - grained interaction**: In order to better capture the fine - grained interactions between different modalities, the paper designs a Multimodal Dynamic Fusion Network (MDFN). This network realizes the fine - grained alignment between acoustic and text modalities through the Cross - Modal Attention (CMA), thereby reducing unimodal bias. 4. **Robustness enhancement**: By using Augmented Contrastive Learning (ACL), the model can learn speaker - independent features, thus maintaining robustness among different speakers. In addition, by generating enhanced texts through back - translation and generating enhanced audios using Text - to - Speech Synthesis (TTS) under zero - shot speaker conditions, the generalization ability of the model is further improved. In conclusion, through the Multi - Modal Multi - Task Learning framework (MMER), this paper solves the deficiencies of existing SER methods in multi - modal fusion, multi - task learning, and robustness, and significantly improves the performance of emotion recognition.

MMER: Multimodal Multi-task Learning for Speech Emotion Recognition

Multimodal Emotional Classification Based on Meaningful Learning

Joint Multimodal Transformer for Emotion Recognition in the Wild

Multimodal Emotion Recognition Based on Deep Temporal Features Using Cross-Modal Transformer and Self-Attention

Multimodal emotion recognition based on audio and text by using hybrid attention networks

MSER: Multimodal speech emotion recognition using cross-attention with deep fusion

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

cross-modal fusion techniques for utterance-level emotion recognition from text and speech

Multimodal Emotion Recognition based on Facial Expressions, Speech, and EEG

Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis

A Multi-Task, Multi-Modal Approach for Predicting Categorical and Dimensional Emotions

Improving Multimodal Emotion Recognition by Leveraging Acoustic Adaptation and Visual Alignment

Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning

Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based models

Multi-modal emotion recognition using tensor decomposition fusion and self-supervised multi-tasking

Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better

CARAT: Contrastive Feature Reconstruction and Aggregation for Multi-Modal Multi-Label Emotion Recognition

Tailor Versatile Multi-Modal Learning for Multi-Label Emotion Recognition

M-MELD: A Multilingual Multi-Party Dataset for Emotion Recognition in Conversations

CMATH: Cross-Modality Augmented Transformer with Hierarchical Variational Distillation for Multimodal Emotion Recognition in Conversation

Leveraging Label Information for Multimodal Emotion Recognition