Abstract:In real-world human–computer interaction, the performance of multimodal emotion recognition models is inevitably affected by random modality feature missing. Thus, robust multimodal emotion recognition methods have attracted increasing attention. However, existing robust multimodal emotion recognition methods generally ignore the distributional gap between modalities. To address this issue, we propose a Twin Disentanglement Transformer Network with Hierarchical-Level Feature Reconstruction (TDTN-HLFR), which aims to enhance model robustness by reconstructing missing features from modality-common and modality-specific perspectives. The network comprises two primary learning processes: disentanglement learning and reconstruction learning. The former learns to efficiently decouple the original multimodal features into modality-specific and modality-common representations through a Disentanglement Transformer Network (DTN). Based on the former, the latter developed the TDTN-HLFR, which learns to reconstruct missing features from modality-common and modality-specific perspectives. In doing this, the TDTN-HLFR mitigates the impact of the distributional gap on the reconstruction of missing features. Extensive experiments are conducted on two multimodal continuous emotion recognition datasets: The Remote Collaborative and Affective (RECOLA) and the Ulm-Trier Social Stress Test (ULM-TSST) datasets. In terms of combined Concordance Correlation Coefficient (CCC) for valence and arousal prediction, our method delivers 0.0852 absolute increases on the RECOLA dataset and 0.0207 on the ULM-TSST dataset compared with the best baseline in the complete modality feature setting while delivering 0.0692 absolute increases on the RECOLA dataset and 0.0347 on the ULM-TSST dataset compared with the best baseline in the incomplete modality feature setting. These results demonstrate the potential of the TDTN-HLFR in real-world human–computer interaction scenarios.

A Deep Spatiotemporal Interaction Network for Multimodal Sentimental Analysis and Emotion Recognition

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

TensorFormer: A Tensor-Based Multimodal Transformer for Multimodal Sentiment Analysis and Depression Detection

Deep Emotional Arousal Network for Multimodal Sentiment Analysis and Emotion Recognition

TDFNet: Transformer-Based Deep-Scale Fusion Network for Multimodal Emotion Recognition

Investigating Multisensory Integration in Emotion Recognition Through Bio-Inspired Computational Models

MEDT: Using Multimodal Encoding-Decoding Network as in Transformer for Multimodal Sentiment Analysis

Emotion Recognition via Environmental Context and Human Body

A twin disentanglement Transformer Network with Hierarchical-Level Feature Reconstruction for robust multimodal emotion recognition

A multimodal shared network with a cross-modal distribution constraint for continuous emotion recognition

SS-Trans (Single-Stream Transformer for Multimodal Sentiment Analysis and Emotion Recognition): The Emotion Whisperer—A Single-Stream Transformer for Multimodal Sentiment Analysis

Multimodal Sentiment Analysis Based on Pre-LN Transformer Interaction

TETFN: A Text Enhanced Transformer Fusion Network for Multimodal Sentiment Analysis

SMIN: Semi-supervised Multi-modal Interaction Network for Conversational Emotion Recognition

A multi-stage dynamical fusion network for multimodal emotion recognition

DGSNet: Dual Graph Structure Network for Emotion Recognition in Multimodal Conversations

Multimodal Sentiment Analysis Using Multi-tensor Fusion Network with Cross-modal Modeling

TEDT: Transformer-Based Encoding–Decoding Translation Network for Multimodal Sentiment Analysis

A Transformer-Based Model With Self-Distillation for Multimodal Emotion Recognition in Conversations

Short and Long Range Relation Based Spatio-Temporal Transformer for Micro-Expression Recognition