Abstract:As deep learning technology research continues to progress, artificial intelligence technology is gradually empowering various fields. To achieve a more natural human-computer interaction experience, how to accurately recognize emotional state of speech interactions has become a new research hotspot. Sequence modeling methods based on deep learning techniques have promoted the development of emotion recognition, but the mainstream methods still suffer from insufficient multimodal information interaction, difficulty in learning emotion-related features, and low recognition accuracy. In this article, we propose a transformer-based deep-scale fusion network (TDFNet) for multimodal emotion recognition, solving the aforementioned problems. The multimodal embedding (ME) module in TDFNet uses pretrained models to alleviate the data scarcity problem by providing a priori knowledge of multimodal information to the model with the help of a large amount of unlabeled data. Furthermore, a mutual transformer (MT) module is introduced to learn multimodal emotional commonality and speaker-related emotional features to improve contextual emotional semantic understanding. In addition, we design a novel emotion feature learning method named the deep-scale transformer (DST), which further improves emotion recognition by aligning multimodal features and learning multiscale emotion features through GRUs with shared weights. To comparatively evaluate the performance of TDFNet, experiments are conducted with the IEMOCAP corpus under three reasonable data splitting strategies. The experimental results show that TDFNet achieves 82.08% WA and 82.57% UA in RA data splitting, which leads to 1.78% WA and 1.17% UA improvements over the previous state-of-the-art method, respectively. Benefiting from the attentively aligned mutual correlations and fine-grained emotion-related features, TDFNet successfully achieves significant improvements in multimodal emotion recognition.

Time-Frequency Transformer: A Novel Time Frequency Joint Learning Method for Speech Emotion Recognition

TFEformer: Temporal Feature Enhanced Transformer for Multivariate Time Series Forecasting

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Multi-Scale Temporal Transformer For Speech Emotion Recognition

Time-Frequency Deep Representation Learning for Speech Emotion Recognition Integrating Self-attention.

FC2VR2: Few Critical Cues-aware Voice Relationship Representation for Speech Emotion Recognition with Transformer

Temporal-spatial Representation Learning Transformer for EEG-based Emotion Recognition

Spatial-temporal Transformers for EEG Emotion Recognition

Multimodal transformer augmented fusion for speech emotion recognition

Towards Learning a Joint Representation from Transformer in Multimodal Emotion Recognition

TDFNet: Transformer-Based Deep-Scale Fusion Network for Multimodal Emotion Recognition

Speech Emotion Recognition via an Attentive Time-Frequency Neural Network

Multilevel Transformer For Multimodal Emotion Recognition

Emotion Recognition Using Transformers with Masked Learning

Emotion-Aware Transformer Encoder for Empathetic Dialogue Generation

Emotion recognition using hierarchical spatial-temporal learning transformer from regional to global brain

Speech Emotion Recognition Via CNN-Transformer and Multidimensional Attention Mechanism

Learning Local to Global Feature Aggregation for Speech Emotion Recognition

Short and Long Range Relation Based Spatio-Temporal Transformer for Micro-Expression Recognition

Transformers for EEG-Based Emotion Recognition: A Hierarchical Spatial Information Learning Model

MelTrans: Mel-Spectrogram Relationship-Learning for Speech Emotion Recognition via Transformers