Abstract:Although speech emotion recognition is challenging, it has broad application prospects in human-computer interaction. Building a system that can accurately and stably recognize emotions from human languages can provide a better user experience. However, the current unimodal emotion feature representations are not distinctive enough to accomplish the recognition, and they do not effectively simulate the inter-modality dynamics in speech emotion recognition tasks. This paper proposes a multimodal method that utilizes both audio and semantic content for speech emotion recognition. The proposed method consists of three parts: two high-level feature extractors for text and audio modalities, and an autoencoder-based feature fusion. For audio modality, we propose a structure called Temporal Global Feature Extractor (TGFE) to extract the high-level features of the time-frequency domain relationship from the original speech signal. Considering that text lacks frequency information, we use only a Bidirectional Long Short-Term Memory network (BLSTM) and attention mechanism to simulate an intra-modal dynamic. Once these steps have been accomplished, the high-level text and audio features are sent to the autoencoder in parallel to learn their shared representation for final emotion classification. We conducted extensive experiments on three public benchmark datasets to evaluate our method. The results on Interactive Emotional Motion Capture (IEMOCAP) and Multimodal EmotionLines Dataset (MELD) outperform the existing method. Additionally, the results of CMU Multi-modal Opinion-level Sentiment Intensity (CMU-MOSI) are competitive. Furthermore, experimental results show that compared to unimodal information and autoencoder-based feature level fusion, the joint multimodal information (audio and text) improves the overall performance and can achieve greater accuracy than simple feature concatenation.

Dual Memory Fusion for Multimodal Speech Emotion Recognition

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

Multimodal transformer augmented fusion for speech emotion recognition

WavFusion: Towards wav2vec 2.0 Multimodal Speech Emotion Recognition

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Multi-level Fusion of Wav2vec 2.0 and BERT for Multimodal Emotion Recognition

Multi-head attention fusion networks for multi-modal speech emotion recognition

Speech Emotion Recognition Using Dual-Stream Representation and Cross-Attention Fusion

Multistage linguistic conditioning of convolutional layers for speech emotion recognition

A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition

Emotion Recognition Model Based on Multimodal Decision Fusion

Multimodal Transformer Fusion for Continuous Emotion Recognition

MFGCN: Multimodal fusion graph convolutional network for speech emotion recognition

Multimodal Emotion Recognition Based on Deep Temporal Features Using Cross-Modal Transformer and Self-Attention

Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning

Speech Emotion Recognition Using Convolution Neural Networks and Multi-Head Convolutional Transformer

Memory based fusion for multi-modal deep learning

Double Multi-Head Attention Multimodal System for Odyssey 2024 Speech Emotion Recognition Challenge

Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based models

An autoencoder-based feature level fusion for speech emotion recognition

Multilevel Transformer For Multimodal Emotion Recognition