STM-Net Based Spatial-Temporal Multi-Modal Fusion Network for Emotion Recognition

Lina Li,Wenjie Deng,Shengli Liao,Xue Qiang,Yuying Rong,Ying Yang,Shixuan Liu,Yumei Zhang
DOI: https://doi.org/10.1117/12.3034151
2024-01-01
Abstract:Emotion recognition plays a vital role in human-computer interaction. However, traditional approaches relying on manual features extraction can get high accuracy results but limited generalization; furthermore results of emotion recognition using a single modality is unreliable. To address these challenges, a multi-modal emotion recognition model called STM-Net, which leverages the spatial and temporal information from electroencephalography (EEG) and eye movement two modalities is proposed based on convolutional neural networks (CNNs) and long short-term memory networks (LSTMs). A CNN-LSTM based model is designed to learn the spatial-temporal features of emotions in EEG signals. For eye-tracking signals, a corresponding CNN-based model is designed for feature extraction. The features from both modalities are fused and fed into a fully connected network for classification, then comprehensive and accurate emotion recognition results are obtained. Experimental results on the SEED-IV multimodal dataset demonstrate the effectiveness of the proposed approach, with accuracy results reaching up to 99.6%, much higher than other similar multimodal emotion recognition models.
What problem does this paper attempt to address?