A twin disentanglement Transformer Network with Hierarchical-Level Feature Reconstruction for robust multimodal emotion recognition

Chiqin Li,Lun Xie,Xinheng Wang,Hang Pan,Zhiliang Wang
DOI: https://doi.org/10.1016/j.eswa.2024.125822
IF: 8.5
2024-11-30
Expert Systems with Applications
Abstract:In real-world human–computer interaction, the performance of multimodal emotion recognition models is inevitably affected by random modality feature missing. Thus, robust multimodal emotion recognition methods have attracted increasing attention. However, existing robust multimodal emotion recognition methods generally ignore the distributional gap between modalities. To address this issue, we propose a Twin Disentanglement Transformer Network with Hierarchical-Level Feature Reconstruction (TDTN-HLFR), which aims to enhance model robustness by reconstructing missing features from modality-common and modality-specific perspectives. The network comprises two primary learning processes: disentanglement learning and reconstruction learning. The former learns to efficiently decouple the original multimodal features into modality-specific and modality-common representations through a Disentanglement Transformer Network (DTN). Based on the former, the latter developed the TDTN-HLFR, which learns to reconstruct missing features from modality-common and modality-specific perspectives. In doing this, the TDTN-HLFR mitigates the impact of the distributional gap on the reconstruction of missing features. Extensive experiments are conducted on two multimodal continuous emotion recognition datasets: The Remote Collaborative and Affective (RECOLA) and the Ulm-Trier Social Stress Test (ULM-TSST) datasets. In terms of combined Concordance Correlation Coefficient (CCC) for valence and arousal prediction, our method delivers 0.0852 absolute increases on the RECOLA dataset and 0.0207 on the ULM-TSST dataset compared with the best baseline in the complete modality feature setting while delivering 0.0692 absolute increases on the RECOLA dataset and 0.0347 on the ULM-TSST dataset compared with the best baseline in the incomplete modality feature setting. These results demonstrate the potential of the TDTN-HLFR in real-world human–computer interaction scenarios.
computer science, artificial intelligence,engineering, electrical & electronic,operations research & management science
What problem does this paper attempt to address?