MMER: Multimodal Multi-task Learning for Speech Emotion Recognition

Sreyan Ghosh,Utkarsh Tyagi,S Ramaneswaran,Harshvardhan Srivastava,Dinesh Manocha
2023-06-04
Abstract:In this paper, we propose MMER, a novel Multimodal Multi-task learning approach for Speech Emotion Recognition. MMER leverages a novel multimodal network based on early-fusion and cross-modal self-attention between text and acoustic modalities and solves three novel auxiliary tasks for learning emotion recognition from spoken utterances. In practice, MMER outperforms all our baselines and achieves state-of-the-art performance on the IEMOCAP benchmark. Additionally, we conduct extensive ablation studies and results analysis to prove the effectiveness of our proposed approach.
Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
This paper aims to solve several key problems in Speech Emotion Recognition (SER). Specifically: 1. **Multi - modal fusion**: Most existing emotion recognition systems are unimodal and mainly rely on acoustic features. However, human emotional expressions are multimodal, including language, intonation, facial expressions, etc. Therefore, the paper proposes a multimodal method that combines information from acoustic and text modalities to more comprehensively capture emotional features. 2. **Multi - task learning**: Traditional SER methods usually only focus on the emotion classification task and ignore the potential improvement of other auxiliary tasks on model performance. The paper introduces three auxiliary tasks, namely Automatic Speech Recognition (ASR), Supervised Contrastive Learning (SCL), and Augmented Contrastive Learning (ACL). These tasks help the model learn richer representations, thereby improving the final emotion recognition performance. 3. **Fine - grained interaction**: In order to better capture the fine - grained interactions between different modalities, the paper designs a Multimodal Dynamic Fusion Network (MDFN). This network realizes the fine - grained alignment between acoustic and text modalities through the Cross - Modal Attention (CMA), thereby reducing unimodal bias. 4. **Robustness enhancement**: By using Augmented Contrastive Learning (ACL), the model can learn speaker - independent features, thus maintaining robustness among different speakers. In addition, by generating enhanced texts through back - translation and generating enhanced audios using Text - to - Speech Synthesis (TTS) under zero - shot speaker conditions, the generalization ability of the model is further improved. In conclusion, through the Multi - Modal Multi - Task Learning framework (MMER), this paper solves the deficiencies of existing SER methods in multi - modal fusion, multi - task learning, and robustness, and significantly improves the performance of emotion recognition.