Multimodal Multi-View Spectral-Spatial-Temporal Masked Autoencoder for Self-Supervised Emotion Recognition

Pengxuan Gao,Tianyu Liu,Jia-Wen Liu,Bao-Liang Lu,Wei-Long Zheng
DOI: https://doi.org/10.1109/icassp48485.2024.10447194
2024-01-01
Abstract:Emotion recognition is a primary and complex task in emotional intelligence. Due to the complexity of human emotions, utilizing multimodal fusion methods can enhance the performance by leveraging the complementary properties of different modalities. In this paper, we propose a Multimodal Multi-view Spectral-Spatial-Temporal Masked Autoencoder (Multimodal MV-SSTMA) with self-supervised learning to investigate multimodal emotion recognition based on electroencephalogram (EEG) and eye movement signals. Our experimental process comprises three stages: 1) In the pre-training stage, we employ MV-SSTMA to train feature extractors for EEG and eye movement signals; 2) In the fine-tuning stage, the labeled data are input to the feature extractors to fuse and fine-tune the features; 3) In the testing stage, our model is applied to recognize emotions with test data to calculate the accuracies of different methods. Our experimental results demonstrate that the multimodal fusion model outperforms the unimodal model on both SEED-IV and SEED-V datasets. In addition, the proposed model can still effectively recognize emotions with various ratios of missing data. These results underscore the efficiency of multimodal self-supervised learning and data fusion in emotion recognition.
What problem does this paper attempt to address?