Abstract:As deep learning technology research continues to progress, artificial intelligence technology is gradually empowering various fields. To achieve a more natural human-computer interaction experience, how to accurately recognize emotional state of speech interactions has become a new research hotspot. Sequence modeling methods based on deep learning techniques have promoted the development of emotion recognition, but the mainstream methods still suffer from insufficient multimodal information interaction, difficulty in learning emotion-related features, and low recognition accuracy. In this article, we propose a transformer-based deep-scale fusion network (TDFNet) for multimodal emotion recognition, solving the aforementioned problems. The multimodal embedding (ME) module in TDFNet uses pretrained models to alleviate the data scarcity problem by providing a priori knowledge of multimodal information to the model with the help of a large amount of unlabeled data. Furthermore, a mutual transformer (MT) module is introduced to learn multimodal emotional commonality and speaker-related emotional features to improve contextual emotional semantic understanding. In addition, we design a novel emotion feature learning method named the deep-scale transformer (DST), which further improves emotion recognition by aligning multimodal features and learning multiscale emotion features through GRUs with shared weights. To comparatively evaluate the performance of TDFNet, experiments are conducted with the IEMOCAP corpus under three reasonable data splitting strategies. The experimental results show that TDFNet achieves 82.08% WA and 82.57% UA in RA data splitting, which leads to 1.78% WA and 1.17% UA improvements over the previous state-of-the-art method, respectively. Benefiting from the attentively aligned mutual correlations and fine-grained emotion-related features, TDFNet successfully achieves significant improvements in multimodal emotion recognition.

Transformer-Based Multimodal Emotional Perception for Dynamic Facial Expression Recognition in the Wild

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

Modality-collaborative Transformer with Hybrid Feature Reconstruction for Robust Emotion Recognition

TDFNet: Transformer-Based Deep-Scale Fusion Network for Multimodal Emotion Recognition

Multimodal Feature Extraction and Fusion for Emotional Reaction Intensity Estimation and Expression Classification in Videos with Transformers

End-to-End Multimodal Emotion Recognition Based on Facial Expressions and Remote Photoplethysmography Signals

TMFER: Multimodal Fusion Emotion Recognition Algorithm Based on Transformer

Multilevel Transformer For Multimodal Emotion Recognition

Multimodal transformer augmented fusion for speech emotion recognition

Transformer-based Multimodal Information Fusion for Facial Expression Analysis

Multimodal Emotion Recognition based on Facial Expressions, Speech, and EEG

A novel transformer autoencoder for multi-modal emotion recognition with incomplete data

Multimodal interaction enhanced representation learning for video emotion recognition

Multi-Attention Module for Dynamic Facial Emotion Recognition

Multimodal Transformer Fusion for Continuous Emotion Recognition

Joint Multimodal Transformer for Emotion Recognition in the Wild

Facial Expression Recognition Based on Multi-modal Features for Videos in the Wild

Residual multimodal Transformer for expression‐EEG fusion continuous emotion recognition

Multi-head attention fusion networks for multi-modal speech emotion recognition

Multi-modal emotion recognition using tensor decomposition fusion and self-supervised multi-tasking

Token-disentangling Mutual Transformer for multimodal emotion recognition