Abstract:Currently, research on emotion recognition has shown that multi-modal data fusion has advantages in improving the accuracy and robustness of human emotion recognition, outperforming single-modal methods. Despite the promising results of existing methods, significant challenges remain in effectively fusing data from multiple modalities to achieve superior performance. Firstly, existing works tend to focus on generating a joint representation by fusing multi-modal data, with fewer methods considering the specific characteristics of each modality. Secondly, most methods fail to fully capture the intricate correlations among multiple modalities, often resorting to simplistic combinations of latent features. To address these challenges, we propose a novel fusion network for multi-modal emotion recognition. This network enhances the efficacy of multi-modal fusion while preserving the distinct characteristics of each modality. Specifically, a dual-stream multi-scale feature encoding (MFE) is designed to extract emotional information from both electroencephalogram (EEG) and peripheral physiological signals (PPS) temporal slices. Subsequently, a cross-modal global–local feature fusion module (CGFFM) is proposed to integrate global and local information from multi-modal data and then assign different importance to each modality, which makes the fusion data tend to the more important modalities. Meanwhile, the transformer module is employed to further learn the modality-specific information. Moreover, we introduce the adaptive collaboration block (ACB), which optimally leverages both modality-specific and cross-modality relations for enhanced integration and feature representation. Following extensive experiments on the DEAP and DREAMER multimodal datasets, our model achieves state-of-the-art performance.

An End-to-End Transformer with Progressive Tri-Modal Attention for Multi-modal Emotion Recognition.

Transformer-Based Multimodal Emotional Perception for Dynamic Facial Expression Recognition in the Wild

Multilevel Transformer For Multimodal Emotion Recognition

Multi-head attention fusion networks for multi-modal speech emotion recognition

Research on Multimodal Emotion Recognition Based on Fusion of Electroencephalogram and Electrooculography

Modality-collaborative Transformer with Hybrid Feature Reconstruction for Robust Emotion Recognition

Multimodal Transformer Fusion for Continuous Emotion Recognition

TMFER: Multimodal Fusion Emotion Recognition Algorithm Based on Transformer

A Multi-Level Circulant Cross-Modal Transformer for Multimodal Speech Emotion Recognition

Multimodal Adaptive Emotion Transformer with Flexible Modality Inputs on A Novel Dataset with Continuous Labels

Multimodal transformer augmented fusion for speech emotion recognition

LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal Emotion Recognition from Unaligned Multimodal Sequences

Bi-Modal Bi-Task Emotion Recognition Based on Transformer Architecture

Multi-Label Multimodal Emotion Recognition With Transformer-Based Fusion and Emotion-Level Representation Learning

Multimodal Neurophysiological Transformer for Emotion Recognition

Transformer Encoder With Multi-Modal Multi-Head Attention for Continuous Affect Recognition

Emotion Recognition with Multimodal Transformer Fusion Framework Based on Acoustic and Lexical Information

Leveraging Multi-modal Interactions among the Intermediate Representations of Deep Transformers for Emotion Recognition

End-to-End Multimodal Emotion Recognition Based on Facial Expressions and Remote Photoplethysmography Signals

MGAT: Multi-Granularity Attention Based Transformers for Multi-Modal Emotion Recognition

MF-Net: a multimodal fusion network for emotion recognition based on multiple physiological signals