Abstract:Although speech emotion recognition is challenging, it has broad application prospects in human-computer interaction. Building a system that can accurately and stably recognize emotions from human languages can provide a better user experience. However, the current unimodal emotion feature representations are not distinctive enough to accomplish the recognition, and they do not effectively simulate the inter-modality dynamics in speech emotion recognition tasks. This paper proposes a multimodal method that utilizes both audio and semantic content for speech emotion recognition. The proposed method consists of three parts: two high-level feature extractors for text and audio modalities, and an autoencoder-based feature fusion. For audio modality, we propose a structure called Temporal Global Feature Extractor (TGFE) to extract the high-level features of the time-frequency domain relationship from the original speech signal. Considering that text lacks frequency information, we use only a Bidirectional Long Short-Term Memory network (BLSTM) and attention mechanism to simulate an intra-modal dynamic. Once these steps have been accomplished, the high-level text and audio features are sent to the autoencoder in parallel to learn their shared representation for final emotion classification. We conducted extensive experiments on three public benchmark datasets to evaluate our method. The results on Interactive Emotional Motion Capture (IEMOCAP) and Multimodal EmotionLines Dataset (MELD) outperform the existing method. Additionally, the results of CMU Multi-modal Opinion-level Sentiment Intensity (CMU-MOSI) are competitive. Furthermore, experimental results show that compared to unimodal information and autoencoder-based feature level fusion, the joint multimodal information (audio and text) improves the overall performance and can achieve greater accuracy than simple feature concatenation.

Combining wav2vec 2.0 Fine-Tuning and ConLearnNet for Speech Emotion Recognition

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

WavFusion: Towards wav2vec 2.0 Multimodal Speech Emotion Recognition

GCF2-Net: global-aware cross-modal feature fusion network for speech emotion recognition

Wav2vec2. 0 and Context Emotional Information Compensation Based Dialogue Speech Emotion Recognition

An autoencoder-based feature level fusion for speech emotion recognition

Speech Emotion Recognition by Combining a Unified First-Order Attention Network with Data Balance

Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition

Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition

Supervised Contrastive Learning with Nearest Neighbor Search for Speech Emotion Recognition

Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning

Real-time Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Learning Fine-Grained Cross Modality Excitement for Speech Emotion Recognition

MFGCN: Multimodal fusion graph convolutional network for speech emotion recognition

Speech emotion recognition based on multi-dimensional feature extraction and multi-scale feature fusion

The Role of Phonetic Units in Speech Emotion Recognition

Speech Emotion Recognition with Complementary Acoustic Representations.

Combined CNN LSTM with attention for speech emotion recognition based on feature-level fusion