A novel transformer autoencoder for multi-modal emotion recognition with incomplete data

Cheng Cheng,Zhaoxin Fan,Lin Feng,Ziyu Jia
DOI: https://doi.org/10.1016/j.neunet.2024.106111
IF: 7.8
2024-01-08
Neural Networks
Abstract:Multi-modal signals have become essential data for emotion recognition since they can represent emotions more comprehensively. However, in real-world environments, it is often impossible to acquire complete data on multi-modal signals, and the problem of missing modalities causes severe performance degradation in emotion recognition. Therefore, this paper represents the first attempt to use a transformer-based architecture, aiming to fill the modality-incomplete data from partially observed data for multi-modal emotion recognition (MER). Concretely, this paper proposes a novel unified model called transformer autoencoder (TAE), comprising a modality-specific hybrid transformer encoder, an inter-modality transformer encoder, and a convolutional decoder. The modality-specific hybrid transformer encoder bridges a convolutional encoder and a transformer encoder, allowing the encoder to learn local and global context information within each particular modality. The inter-modality transformer encoder builds and aligns global cross-modal correlations and models long-range contextual information with different modalities. The convolutional decoder decodes the encoding features to produce more precise recognition. Besides, a regularization term is introduced into the convolutional decoder to force the decoder to fully leverage the complete and incomplete data for emotional recognition of missing data. 96.33%, 95.64%, and 92.69% accuracies are attained on the available data of the DEAP and SEED-IV datasets, and 93.25%, 92.23%, and 81.76% accuracies are obtained on the missing data. Particularly, the model acquires a 5.61% advantage with 70% missing data, demonstrating that the model outperforms some state-of-the-art approaches in incomplete multi-modal learning.
computer science, artificial intelligence,neurosciences
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper primarily addresses the issue of performance degradation in multimodal emotion recognition (MER) due to missing data. Specifically, it proposes a new model called the "Transformer Autoencoder" (TAE), which aims to utilize partially observable data to fill in the missing data in multimodal emotion recognition. #### Research Background - **Importance of Multimodal Signals**: Multimodal signals (such as physiological and non-physiological signals) can more comprehensively represent emotional information. - **Real-world Challenges**: In real environments, due to various factors (such as equipment failure, occlusion, etc.), multimodal signals often experience data loss, leading to significant performance degradation in emotion recognition. #### Main Contributions 1. **Unified Model**: A unified deep learning framework called the Transformer Autoencoder (TAE) is proposed to handle incomplete multimodal data. This is the first attempt to combine EEG signals with other non-physiological signals to address the issue of missing data in multimodal datasets. 2. **Multimodal Feature Extraction**: The TAE model includes modality-specific hybrid transformer encoders and cross-modal transformer encoders to build long-range dependencies between different modalities. 3. **Regularization Term**: A regularization term is introduced to enable the encoder and decoder to learn more discriminative features during training, thereby improving the classification performance of incomplete data. Through these methods, the TAE model can effectively capture the intrinsic relationships between missing and available data, thereby enhancing the performance of multimodal emotion recognition. Experimental results show that the model exhibits excellent performance under various degrees of data loss on the DEAP and SEED-IV datasets.