Semantic Alignment Network for Multi-modal Emotion Recognition

Mixiao Hou,Zheng Zhang,Chang Liu,Guangming Lu
DOI: https://doi.org/10.1109/tcsvt.2023.3247822
IF: 5.859
2023-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Modality alignment can maintain the consistency of semantics in multi-modal emotion recognition tasks, ensuring that features from different modalities accurately represent the emotion-related information in an encoding space. However, current alignment models either focus only on the local fusion of different modal representations or lack a mining process for unimodal specificity information. We design a Semantic Alignment network based on Multi-Spatial learning (SAMS) for multi-modal emotion recognition, which achieves local and global alignment between modalities using high-level emotion representations of different modalities as supervisory signals. SAMS builds a multi-spatial learning framework for each modality, and constructs a self-modal interaction module under this framework based on cross-modal semantic learning. SAMS provides two learning spaces for each modality, one to detect the affective information for a specific modality, and the other to learn semantic knowledge from other modalities. Subsequently, the features of these two spaces are aligned in temporal and utterance levels by homologous encoding and different target constraints. Based on the alignment characteristics of these two spaces, a self-modal interaction is built to investigate the fusion representation by exploring the global correlation between the alignment features in unimodal multi-spatial learning. In experiments, our proposed model yields consistent improvements on two standard multi-modal benchmarks, and outperforms state-of-the-art approaches. The code of our SAMS is available at: https://github.com/xiaomi1024/code_SAMS.
engineering, electrical & electronic
What problem does this paper attempt to address?