Abstract:In real-world human–computer interaction, the performance of multimodal emotion recognition models is inevitably affected by random modality feature missing. Thus, robust multimodal emotion recognition methods have attracted increasing attention. However, existing robust multimodal emotion recognition methods generally ignore the distributional gap between modalities. To address this issue, we propose a Twin Disentanglement Transformer Network with Hierarchical-Level Feature Reconstruction (TDTN-HLFR), which aims to enhance model robustness by reconstructing missing features from modality-common and modality-specific perspectives. The network comprises two primary learning processes: disentanglement learning and reconstruction learning. The former learns to efficiently decouple the original multimodal features into modality-specific and modality-common representations through a Disentanglement Transformer Network (DTN). Based on the former, the latter developed the TDTN-HLFR, which learns to reconstruct missing features from modality-common and modality-specific perspectives. In doing this, the TDTN-HLFR mitigates the impact of the distributional gap on the reconstruction of missing features. Extensive experiments are conducted on two multimodal continuous emotion recognition datasets: The Remote Collaborative and Affective (RECOLA) and the Ulm-Trier Social Stress Test (ULM-TSST) datasets. In terms of combined Concordance Correlation Coefficient (CCC) for valence and arousal prediction, our method delivers 0.0852 absolute increases on the RECOLA dataset and 0.0207 on the ULM-TSST dataset compared with the best baseline in the complete modality feature setting while delivering 0.0692 absolute increases on the RECOLA dataset and 0.0347 on the ULM-TSST dataset compared with the best baseline in the incomplete modality feature setting. These results demonstrate the potential of the TDTN-HLFR in real-world human–computer interaction scenarios.

Towards Learning a Joint Representation from Transformer in Multimodal Emotion Recognition

Multimodal Transformer Fusion for Continuous Emotion Recognition

Modality-collaborative Transformer with Hybrid Feature Reconstruction for Robust Emotion Recognition

Multi-Label Multimodal Emotion Recognition With Transformer-Based Fusion and Emotion-Level Representation Learning

Multilevel Transformer For Multimodal Emotion Recognition

Multimodal transformer augmented fusion for speech emotion recognition

Joint Multimodal Transformer for Emotion Recognition in the Wild

A twin disentanglement Transformer Network with Hierarchical-Level Feature Reconstruction for robust multimodal emotion recognition

Multimodal interaction enhanced representation learning for video emotion recognition

Transformer-Based Multimodal Emotional Perception for Dynamic Facial Expression Recognition in the Wild

Transformer Based Multimodal Speech Emotion Recognition with Improved Neural Networks

Multimodal Transformer with Learnable Frontend and Self Attention for Emotion Recognition

Emotion Recognition with Pre-Trained Transformers Using Multimodal Signals

A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis

Multi-head attention fusion networks for multi-modal speech emotion recognition

TDFNet: Transformer-Based Deep-Scale Fusion Network for Multimodal Emotion Recognition

Multimodal Sentiment Analysis Based on Transformer and Low-rank Fusion

Multi-Modal Emotion Recognition by Text, Speech and Video Using Pretrained Transformers

Multimodal Transformer Learning for Continuous Emotion Recognition

A Unified Transformer-based Network for multimodal Emotion Recognition

Multimodal Transformer Fusion for Emotion Recognition: A Survey