Abstract:Although speech emotion recognition is challenging, it has broad application prospects in human-computer interaction. Building a system that can accurately and stably recognize emotions from human languages can provide a better user experience. However, the current unimodal emotion feature representations are not distinctive enough to accomplish the recognition, and they do not effectively simulate the inter-modality dynamics in speech emotion recognition tasks. This paper proposes a multimodal method that utilizes both audio and semantic content for speech emotion recognition. The proposed method consists of three parts: two high-level feature extractors for text and audio modalities, and an autoencoder-based feature fusion. For audio modality, we propose a structure called Temporal Global Feature Extractor (TGFE) to extract the high-level features of the time-frequency domain relationship from the original speech signal. Considering that text lacks frequency information, we use only a Bidirectional Long Short-Term Memory network (BLSTM) and attention mechanism to simulate an intra-modal dynamic. Once these steps have been accomplished, the high-level text and audio features are sent to the autoencoder in parallel to learn their shared representation for final emotion classification. We conducted extensive experiments on three public benchmark datasets to evaluate our method. The results on Interactive Emotional Motion Capture (IEMOCAP) and Multimodal EmotionLines Dataset (MELD) outperform the existing method. Additionally, the results of CMU Multi-modal Opinion-level Sentiment Intensity (CMU-MOSI) are competitive. Furthermore, experimental results show that compared to unimodal information and autoencoder-based feature level fusion, the joint multimodal information (audio and text) improves the overall performance and can achieve greater accuracy than simple feature concatenation.

Maximum A Posteriori Based Fusion Method For Speech Emotion Recognition

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

A Hybrid PNN-GMM classification scheme for speech emotion recognition

Information Fusion in Attention Networks Using Adaptive and Multi-level Factorized Bilinear Pooling for Audio-visual Emotion Recognition

An autoencoder-based feature level fusion for speech emotion recognition

Ann Based Decision Fusion for Speech Emotion Recognition

MF-Net: a multimodal fusion network for emotion recognition based on multiple physiological signals

A Novel Dual-Modal Emotion Recognition Algorithm with Fusing Hybrid Features of Audio Signal and Speech Context

Multi-level attention fusion network assisted by relative entropy alignment for multimodal speech emotion recognition

Speech Emotion Recognition with Emotion-Pair Based Framework Considering Emotion Distribution Information in Dimensional Emotion Space.

Multi-head attention fusion networks for multi-modal speech emotion recognition

Fusion Model for Speech Emotion Recognition with Low Level Descriptor Features

Speech emotion recognition using multimodal feature fusion with machine learning approach

Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning

A Novel Emotion-Aware Method Based on the Fusion of Textual Description of Speech, Body Movements, and Facial Expressions

Multimodal Emotion Recognition Based on Feature Selection and Extreme Learning Machine in Video Clips.

A novel feature fusion network for multimodal emotion recognition from EEG and eye movement signals

Multi-level Speech Emotion Recognition Based on HMM and ANN.

Classifier fusion for speech emotion recognition

Emotion Detection System Based on New Double-Mode Fusion Algorithm

A New Fuzzy Cognitive Map Learning Algorithm for Speech Emotion Recognition